Foreward

This text accompanies my course, POLI 381: Managing Data in Political Science.

Political science majors at UBC enter POLI 381 having completed a first-year course on research design (POLI 110) and (ideally) a third-year course on basic statistics (POLI 380). The motivation for designing this course flowed from a recognition that many students struggled to translate the theoretical lessons from POLI 110 and POLI 380 into the practical skills necessary for conducting independent research in their fourth-year seminars.

Students often faced challenges in locating data to test their theories, understanding file and variable formats, and managing data sets through operations such as merging and expansion. POLI 381 was designed to bridge the gap between the theory and practice of quantitative social science research, equipping students with the hands-on skills needed to navigate and analyze real-world data effectively.

This text reproduces and expands on my lecture slides, notes and coding examples. The motivation for repackaging my course materials as a virtual text book stemmed from a recognition that the course material is more effectively conveyed in a medium that allows students to move through the material at their own pace and to tinker with the data and code as they do so. I hope that students find it useful.

Christopher Kam, 2025

© Christopher Kam. This work is licensed under a Creative Commons license.




Introduction

This text serves as a practical guide for students navigating the use of data in political science research. Data is integral to understanding modern political phenomena, and this course is designed to provide students with the necessary tools to handle, manage, and analyze data effectively. By mastering these skills, students will be better prepared for conducting independent research and applying quantitative methods in academic or professional settings.

The course is designed on the expectation that students enter with some previous exposure to basic courses in: (a) research design and inference, and (b) statistics. Thus, students may have some experience running regressions in Stata or R, but they lack the knowledge to build the underlying data sets required for regression analysis. Additionally, they may be unfamiliar with how to effectively present regression results or export them into documents. This course addresses these gaps, focusing on the essential data skills needed to support and communicate quantitative research.

The course offers extensive coding advice using Stata, R, and Google Sheets—tools commonly employed in academic research and data-driven industries—but it is not primarily a coding course or data science course. Instead, the primary goal is to impart the data management skills necessary to make students effective, independent social scientists, positioning them to teach themselves additional methods and tools as needed. In today’s environment, success depends on students knowing the right questions to ask—and even how to prompt tools like ChatGPT to assist them effectively.

The course addresses five key questions encountered in quantitative research:

  1. What data do I need?
  2. Where do I find those data?
  3. How do I organize those data?
  4. How do I describe and evaluate those data?
  5. How do I interpret and communicate results effectively?

These questions structure the overall workflow of the course, with later chapters covering data collection, transformation, and presentation in depth.

The course is structured around a data project, which allows students to build and analyze a data set to address a specific research question. Emphasis is placed on logical and efficient workflows, careful data construction, cleaning, and documentation and clear presentation of results. As such, this text offers a step-by-step framework to help students meet the requirements of the course and develop foundational competence in data management and statistical analysis.

Throughout the text, particular attention is given to data organization, cleaning, merging, and harmonization. Students will engage with these processes using practical examples and hands-on exercises, ensuring that their understanding moves beyond theoretical concepts into applied contexts.

The text is organized into the following chapters, reflecting the two main phases of the course. Chapters 1 and 2 provide students with the conceptual and contextual knowledge necessary to build a political science data set. Chapters 3-5 focus on preparing and managing a data set. This is where students will find the basic protocols and coding techniques necessary to construct, clean, and document a data set. Chapters 6-8 introduce a variety “quality control” concepts and techniques that can be used to evaluate data quality. Chapters 9-11 focus on the presentation of multivariate data through tables, graphics, and regression results.

The book draws from many key works in the field, but two sources stand out. First, Hadley Wickham’s R for Data Science – which provided much of the inspiration for the structure and coverage of this book. My contribution is to apply many of Wickham’s coding lessons to political science data, to replicate coding lessons in Stata and (where applicable) Sheets, and more generally to offer a overtly disciplinary perspective on data science (e.g. in relating standards of documentation and presentation). Nonetheless, I highly recommend that students refer to R for Data Science throughout the course; it provides a more in-depth and comprehensive coverage of the capabilities of R than I provide below.

Second, much of the advice I offer on how to craft effective graphics comes straight out of two books by William S. Cleveland: 1993. Visualizing Data. Hobart Press; and 1994. The Elements of Graphing Data. Hobart Press. These are expensive books, but in my opinion the latter should be on the bookshelf of every social scientist and data analyst. My contribution lies in conveying Cleveland’s advice using political science examples, and supplying the R and Stata code to generate those examples.

By the end of the course, students will be able to identify and work with diverse data formats, use software for essential data operations, and present their findings effectively.




A Note on Code Chunks

I include many coding examples throughout the text. They appear in grey-shaded boxes like this:


 a code chunk
 

Every code chunk is written once in Stata, and then again in R. Sometimes, a third code chunk shows how to effect the same procedure in Sheets.

I have left most of the the code chunks “inert” so that they do not run every time this R markdown file is “knit” (i.e., processed). It is easy to undo this in the markdown file by replacing, e.g. {r, eval=FALSE} with {r, eval=TRUE}.

I have also written the code chunks without any reference to file paths. That is, I write,


 df <- read.dta("stata_file.dta")
 
       or
 
 use stata_file.dta
 
       not
 
 df <- read.dta("C:/main/sub/stata_file.dta")
 
       or 
       
 use "C:\main\sub\stata_file.dta"

You can either copy the path to your data files into my code, or change your working directory. In Stata, type immediately on opening Stata:


cd "C:\main\sub\" 

with C:\main\sub altered to match your computer’s directory structure.

Similarly, in R, type immediately on starting R:

setwd("C:/main/sub/")



Part I – Building a Data Set



Chapter 1: Introduction to Data and Data Sets

Overview

This chapter establishes the basic terminology and framework for understanding data and data sets, providing a foundation for the practical work that follows in subsequent chapters. Mastery of these concepts is essential, as students will rely on them to explore data independently using online resources. Key topics include the definition and types of data, the structure of data sets, and an introduction to data manipulation through software. Throughout the chapter, exercises and corresponding examples are provided to ensure students gain practical experience.

Overview of Data: Statistical Concepts vs. Data Frames

This section distinguishes between the conceptual understanding of data (from statistics) and its practical organization in a data frame (used in software like R, Stata, or Excel). While statistics classes teach you how to interpret variables, distributions, and relationships, this course will focus on managing these variables effectively in a structured format. Think of this as moving from theory to practice. For example:

  • Statistical concept: A variable representing income might be treated as continuous or categorical based on the analysis.

  • Practical format: In a data frame, this variable could be stored as a numeric column or a string column, depending on how it was recorded or cleaned.

Understanding both perspectives will help you diagnose issues like mismatches between data types during merges or coding errors.

Variables: Linking Statistical Types to Data Formats

A variable’s statistical type (quantitative or qualitative) determines how it should be treated during analysis, but it also often dictates its format in a data frame.

The table below illustrates the link:

Statistical Type Example Common Data Formats in Software
Quantitative (Discrete) Number of political parties Numeric (integer)
Quantitative (Continuous) GDP per capita Numeric (float or double)
Qualitative (Categorical) Political system (democratic/autocratic) String or factor (coded numerically if ordinal)
Qualitative (Text) Speech transcripts or comments String

Levels of Analysis: Microdata and Macrodata

Data can be analyzed at different levels depending on the research question and the structure of the data set. Understanding whether data represent individual-level observations or aggregated summaries is crucial for effective analysis and data management. Misidentifying these levels can lead to errors when merging, aggregating, or interpreting the data. We typically work with data collected at two levels of analysis:

  • Microdata are data that are observed at the individual level, such as survey responses from individual people or an individual voter’s decision to vote or abstain at a given election.

  • Macrodata: are microdata that are aggregated at higher levels, such as national GDP or country-level voter turnout. For this reason, macro-data are also called aggregate data.

In any particular study, some data may fall between the micro-level and the macro-level. Imagine, for example, that you are a policy analyst tasked with studying the effects of a curriculum change on student performance. The individual students’ test scores would be the microdata, and the average test score for the school would be the corresponding macro-data. However, the microdata—the individual students’ test scores—might also be aggregated at the level of the classroom or the grade level. Sometimes, data aggregated at such intermediate levels of analysis are called mesodata.

More generally, the data in this hypothetical school test score example exhibit what is known as a nested or hierarchical structure, with the microdata (the students’ test scores) nested inside mesodata aggregates (class averages), and those mesodata aggregates nested inside macro-data aggregates (the school average). Correctly defining the level of analysis of your variables is critical to avoid errors when combining data. In later chapters, you will see that mixing units of analysis—such as country-level and individual-level data—can produce inconsistent or unusable results during merges and appends.

Exercise
Are the following examples of microdata or macrodata?

1. Individual survey responses to a question about the country’s most important issue. Answer: Microdata
2. Average GDP per country. Answer: Macrodata
3. Individual-level voting records. Answer: Microdata
4. The number of days a particular interstate conflict lasts. Answer: Macrodata

Understanding Data Frames and Structures

A data frame is a common structure used to store tabular data, where rows represent cases and columns represent variables. For example:

Country Data
ID Country Year GDP (trillions) Population (millions)
1 USA 2020 27.4 329.50
2 Canada 2020 2.0 38.03
3 UK 2020 2.7 67.08

Here, each row represents a given country in a specific year. Thus, the observations in this data frame are country-years. The columns represent variables. For example, the entries in the column labeled “GDP” represent the GDP for each country in that year.

To be more precise, a variable is an object that takes on different values depending on the outcome of a random experiment. A specific outcome of that random experiment is called a realization. The canonical example of a random experiment is a coin flip: the (potential) outcome of a coin flip is a variable, and the result of a specific coin flip is one realization of that variable. Similarly, GDP in the table above is a variable, and the US’s GDP of 27.4 in 2020 is one realization of this variable.

An observation, in strict terms, encompasses all realizations associated with a given case. In this example, the observation for the US in 2020 includes not only its GDP of 27.4 but also its population of 329.50.

Although this terminology may seem nit-picky, it helps clarify differences between words that we often use interchangeably yet which convey subtly distinct meanings.

In particular:

  • Observation: A statistical entity consisting of the realizations of one or more variables for a given case.

  • Case: The unit being studied. In the example above, the cases are countries.

  • Record: A row in a data frame or, more generally, an entry in a database.

Observations and cases are conceptual objects related to our theory, while a record is a property of the data frame that we construct to test our theory. This distinction is important. For instance, simply duplicating records in a data frame does not double the information in that data frame, but it does alter its structure—which could have significant implications for any subsequent analysis.

These terms are fundamental to building effective cross-walks. For example, when merging country-year data sets, identifying the correct unit (e.g., country or individual) and ensuring variable compatibility is essential for creating accurate cross-walks.

Long vs. Wide Data Formats

Data frames come in different configurations or formats:

  • Long Format: Data where each row is an observation, often used in time-series or panel data.

  • Wide Format: Data where each row contains multiple variables representing different time points or conditions.

For example, here is an excerpt of a data set in long format:

Income Data
ID Year Income
1 2020 50000
1 2021 52000
2 2020 45000
2 2021 47000

Here is the same excerpt in wide format.

ID Income 2020 Income 2021
1 50000 52000
2 45000 47000

When merging data, long-format data sets are often preferred for time-series analysis because each observation is captured as a unique record by time period and unit. Conversely, wide-format data sets are better suited for cross-sectional data where variables across time periods are stored in separate columns.

Importantly, we can convert between long and wide formats as needed, depending on the specific requirements of our analysis—a process we will explore later in this course. This conversion is called transposition in matrix algebra. Although we will not delve into complex mathematical explanations, it is important to understand this connection because many advanced texts and libraries describe data manipulations in terms of matrix operations.

Exercise: Transposing a data frame in R
#Use pivot_wider to convert long to wide

# R Code:
library(tidyr)
# Long format data set
data_long <- data.frame(ID = c(1, 1, 2, 2, 3, 3),
                        Year = c(2020, 2021, 2020, 2021, 2020, 2021),
                        Score = c(85, 90, 78, 80, 92, 94))
View(data_long)

# Convert to wide format
data_wide <- pivot_wider(data_long, names_from = Year, values_from = Score)
View(data_wide)
Exercise: Transposing a data frame in Stata
//Use reshape wide to convert long to wide
//Either run here after starting Rs Stata engine or cut & paste into a do.file:

clear  
input ID Year Score
1 2020 85
1 2021 90
2 2020 78
2 2021 80
3 2020 92
3 2021 94
end

describe
//Note the number of variables and observations

//reshape transposes the data frame from long to wide
reshape wide Score, i(ID) j(Year)

describe
//Note the number of variables and observations

//Convert wide back to long
reshape long

In the next exercise, we introduce the concept of a vector. In mathematical terms, a vector is an ordered collection of values that can represent quantities such as scores, measurements, or observations. A column vector is a vertical arrangement of values, while a row vector is a horizontal arrangement of values. Transposing a vector involves converting a column vector into a row vector or vice versa.

Exercise: Transposing a column vector in Sheets
#Steps for Transposing a Column Vector in Google Sheets:
    1. Create a column vector in Google Sheets containing the following values:
       85
       78
       92
       (These could represent individual scores, for example.)
    2. Select the range of cells containing the column vector.
    3. Copy the selected range.
    4. Select a new location in the sheet where you want to transpose the vector.
    5. Right-click and select Paste Special > Transpose to convert the column vector into a row vector.

After transposition, the values will be displayed horizontally, like this:

85   78   92

Understanding vectors and transposition is important because many data manipulation techniques—especially in time-series or multivariate contexts—can be explained using these concepts. As we proceed, you will see that operations on data frames often parallel certain mathematical operations on matrices and vectors.

Research Designs, Data Structures, and Subscripts

Different research designs tend to produce data of a given structure:

  1. In a cross-section, multiple cases are observed at one time period.

  2. In a time-series, one case is observed at multiple time periods.

  3. In a time-series-cross-section (TSCS), multiple cases are observed at multiple time periods.

  4. A panel is a subset of TSCS where the same cases are observed across multiple time periods.

Subscripts in statistical equations indicate the structure of the underlying data set:


\(\text{TURNOUT}_{it} = \alpha + \beta \text{CLOSE}_{it} + \delta \text{DENSITY}_{it}\)


Denotes a time-series-cross-section (TSCS) data set where observations are indexed by case (\(i\)) and time (\(t\)). The indexing tells us that different realizations of these three variables are observed for each case (\(i\)) and time (\(t\)).


\(\text{TURNOUT}_i = \alpha + \beta \text{CLOSE}_{i} + \delta \text{DENSITY}_{i}\)


Indicates a cross-sectional data set with observations indexed by case (\(i\)). The time subscript, (\(t\)), does not appear because all cases are observed at the same time.


\(\text{TURNOUT}_t = \alpha + \beta \text{CLOSE}_{t} + \delta \text{DENSITY}_{t}\)


Refers to a time-series data set with a single case observed over multiple time periods, with the time periods indexed by (\(t\)). The case subscript, (\(i\)), does not appear because only one case is observed.

Exercise
 Are the following data sets are cross-sections, time-series, or TSCS:
1. Election turnout rates in multiple countries for a single year.
2. Economic indicators for a country over 20 years.
3. A survey conducted annually across different countries.
Practical Exercise on Data Exploration
1. Open a sample data set (provided in the course repository).
2. Identify the units of analysis and variables.
3. Determine whether the data set is in long or wide format.
4. Convert the data set from long to wide format in R, Stata, or Sheets.

Conclusion

This chapter provides the foundation for understanding data structures and key concepts. In subsequent chapters, you will apply this knowledge to locate, clean, merge, and append data sets effectively.




Chapter 2: Data Sources and Collection for Projects

Overview

This chapter focuses on identifying, locating, and understanding the types of data sources that students can use for their projects. Emphasis is placed on government agencies, international organizations, and major data archives, all of which offer ready-to-use data suitable for political science research. Understanding these sources is critical for developing a successful Time-Series Cross-Section (TSCS) data set for the assigned projects.

Research Questions for Data Projects

The following research questions form the basis of the assigned data projects. Each question requires the construction of a Time-Series Cross-Section (TSCS) data set, designed to capture variation across time and units (e.g., countries, individuals, or regions). Success in these projects depends on a solid understanding of how to manage covariates, align levels of analysis, and balance internal and external validity.


Comparative Politics Project

Are parties becoming more extreme, or are voters simply rewarding extremists more than before?

In many democracies, party positions appear to be drifting outward, yet it is unclear whether parties themselves are becoming more extreme or whether voters are increasingly supporting parties that were always outliers. One possibility is that mainstream parties have shifted their platforms in response to new issues such as immigration, climate change, or economic insecurity. Another is that voters, frustrated with economic stagnation or social change, are rewarding parties at the ideological fringes even if party positions have not moved much.

To get started, you will need to define what “issue extremism” means—issue-by-issue distance from the center, or aggregated ideological scale scores—and compare it with vote extremism, such as the vote share for far-left and far-right parties. You will first want to identify a consistent unit of analysis (party-year or country-year), and then assemble data on party positions and electoral results across time.

Suggested Variables - Party left–right position; issue-specific positions (e.g., immigration, environment)
- Vote shares for all parties; total vote share of “extreme” parties
- Controls: party age, party family, economic conditions

Data Sources - Chapel Hill Expert Survey (CHES) for party positions
- Manifesto Project (CMP/MARPOR) for issue salience and left–right scores
- ParlGov or national electoral commissions for vote shares
- OECD/World Bank for economic covariates



International Relations Project

Are wars becoming more or less deadly over time—and does the answer depend on the type of conflict?

Some scholars argue that wars are declining in lethality due to better international norms, peacekeeping, and humanitarian law. Others believe that the nature of violence is shifting—civil wars, insurgencies, and hybrid conflicts may produce fewer battlefield deaths but more sustained civilian harm. Deaths may be becoming more concentrated in a smaller number of highly destructive conflicts (e.g., Syria, Ukraine), even if the global average is falling.

To get started, you will need to decide how you want to measure “deadliness”—battle deaths, total deaths (battle + civilian), or death rates per capita—and whether to analyze all conflicts together or compare civil vs interstate wars separately. You will also need to merge conflict-level data with country-year covariates to explore correlates of lethality. Be explicit about terrorism: decide whether terrorism-related fatalities count as part of the war, or treat terrorism as a confound that may influence conflict dynamics.

Suggested Variables - Battle deaths; total deaths; civilian deaths
- Conflict type (interstate, civil, internationalized civil)
- Duration; number of conflict actors; external intervention
- Controls: GDP per capita, population, geographic region; terrorism fatalities (confound or component)

Data Sources - UCDP Battle-Related Deaths Dataset and UCDP Georeferenced Events (GED)
- Correlates of War (COW) for interstate conflict data
- START Global Terrorism Database (GTD) for terrorism-related fatalities (optional confound/component)
- World Bank/OECD for GDP, population, and other covariates



Public Policy Project

Are declining PISA scores driven more by changes in educational resources or by changes in student populations?

PISA scores have declined in many OECD countries, but the reasons are contested. One narrative argues that school systems have become overstretched—teachers, curricula, and institutions cannot keep pace with technological, demographic, and social pressures. Another suggests that population changes—migration, age structure, socioeconomic composition—may explain part of the decline independently of school quality. Understanding which trends matter most requires linking education outcomes to broader demographic and policy contexts.

To get started, you will need to define which PISA outcomes to track (reading, math, science; means or distributions) and then assemble country-year data on educational inputs and demographic change. The core task is to distinguish resource-side explanations (spending, teacher ratios, class size) from population-side explanations (migration flows, socioeconomic composition). Your unit of analysis will likely be the country-year, and you will merge PISA data with demographic and policy indicators.

Suggested Variables - PISA test scores (mean scores by domain)
- Education inputs: spending per pupil, teacher–student ratios, teacher salaries
- Demographics: migration flows, foreign-born share, socioeconomic composition proxies
- Controls: GDP per capita, inequality, youth unemployment

Data Sources - OECD PISA Data Explorer or downloadable PISA country-level files
- OECD Education at a Glance for teacher ratios, spending
- OECD Migration Database and Eurostat for demographic indicators
- World Bank for GDP, inequality, and other covariates


Each project involves working through specific stages of data set construction (outlined below), from identifying data sources to final documentation. Careful attention to the varying covariates, unit of analysis, and potential sources of bias will be critical to producing meaningful results.

Introduction to Research Questions and Data Needs

Locating data is not the first step in the research process. Instead, you should begin with:

  1. A research question or puzzle: This question should be amenable to empirical investigation. Even descriptive questions, such as “Are coalition governments more fragile now than in the past?”, require structured data to answer.

  2. A theory or model: This includes defining the key variables to be measured and understanding their relationships. Writing a mathematical model, including subscripts for units of analysis, can clarify the structure of the data.

Once these elements are established, you can move to the stage of identifying and collecting data.

Structured Data Repositories

Let’s begin our review of data sources by familiarizing ourselves with “structured” data repositories. These repositories are structured in the sense that a single organization collects and curates the data. Even so, there is a a vast range of structured repositories. The list below is by no means a comprehensive, but it should serve as a good starting point for your research projects.

Government Agencies

Government departments and central statistical agencies often provide extensive data on population, economy, health, and elections.

These are just 5 of thousands of government departments and central statistical agencies in countries around the world! These are often sources of the original data that is subsequently disseminated by international organizations, like the IMF or UN.

Exercise
Visit one of the above government portals and download a data set on economic or social indicators. Explore its key variables using Google Sheets, R, or Stata.
Election Commissions

For political scientists, an important sub-type of government statistical agency are national election commissions. Election commissions maintain data related to elections, campaign finance, constituency boundaries, and more. However, these data sets often require additional formatting.

Exercise
Identify a data set on past elections, import it into your preferred software, and examine its structure and level of aggregation.

International Organizations

Many international organizations exist primarily to collect and disseminate data on global issues. These include:

Exercise
Explore one of the above databases, find a relevant data set, and document its key variables and time period.

Social Science Data Archives

Major social science data archives offer a wealth of survey data, opinion polls, and other social indicators. Some of the main social science data archives are:

These archives often include detailed metadata, making them highly useful for structured research projects.

Social Surveys

Many social surveys in operation around the world; these often touch on politics, political attitudes – but are more generally focused on social attitudes, family & lifestyle choices, education & employment patterns

Closely related data are collected by:

Election Studies

A sub-class of social surveys that are of special interest to political scientists are election studies that and now routinely conducted in many democracies around the world. You can often find election students in repositories like the ICPSR, but many now have stand-alone sites:

And many more such studies exist!

Social Science Data Projects

There exist many repositories that are dedicated to specific political science projects, in both comparative politics and international relations. These include:

These data repositories are affiliated with projects on political regimes:

In addition, here are some sources for international relations data on conflict and trade flows:

Replication Warehouses

Replication warehouses contain data sets linked to specific academic papers, allowing students to replicate and verify research findings.

Exercise
Choose a data set from a replication warehouse and attempt to replicate the descriptive statistics of a published paper.

Ad Hoc Sources and Replication Datasets

While structured repositories are preferred, ad hoc sources can be valuable for specific projects:

Exercise
Explore one ad hoc repository and identify a data set that could be merged with data from a more structured source.

The above are just a small fraction of data sources that now exist. Many others exist that offer you structured data at the press of a button. Yet other databases offer you unstructured data in the form of PDFs or raw text of historical documents, e.g. the full-text of debates of the UK Parliament!

Guidelines for Selecting Reliable Data Sources

  1. Start with structured repositories: Use government, international organization databases, or social science repositories before turning to ad hoc sources. The former generally do a better job than the latter of providing reliable metadata, documentation, and ensuring the provenance of the data.

  2. Document your search process: Keep a log of where you searched, which keywords you used, and any issues encountered. This need not be complex: a brief list on in a plain text document suffices. It’s helpful to record the DOI of any data set that catches your eye. This provides you with a permanent link to that resource.

  3. Evaluate data reliability: Prioritize sources with clear metadata and documentation. Always prefer original data to second-hand data.

Common Pitfalls in Data Collection

  1. Mixing Levels of Analysis: Students often mix levels of analysis or time scales within their models, resulting in data sets where some variables are essentially constants, and which cannot therefore explain variation. This is function of one of the basic laws of probability: the covariance between any random variable, \(X\), and constant, \(K\), is zero: \(Cov(XK)=0\)!

The problem can be illustrated in graphical terms. Let’s say that we are investigating the relationship between economic conditions and government approval. For each month of 2022, we collect data on a national government’s approval rate, and then merge these monthly approval rate data to annual data on country’s per capita GDP. The figure below shows the relationship between the two variables: the government’s approval varies month-to-month, but GDP per capita is constant for the whole year. Changes in per capita GDP cannot possibly account for month-to-month variation in the governments approval rate for the straightforward reason that per capita GDP is a constant in our data set!

Students rarely produce data sets where the problem is this severe–but that tends merely to disguise the problem. Consider a second example. We hypothesize that economic inequality drives the popular demand for income equality: as economic inequality increases, so too does the demand for greater income equality. We gather data from the World Values Survey (WVS) and the World Inequality Database to test our hypothesis. From the former, we extract a variable that measures respondents’ agreement (on a 10-point scale) with the statement that, “Incomes should be made more equal.” This is our measure of the demand for income equality. From the latter, we obtain the proportion of national income going to the top-10 percent of earners in 2018, the year the WVS was conducted. This is our measure of economic inequality. We use these data to produce the following graph (limited to Australia, Brazil and Romania for clarity):

Our measure of economic inequality cannot account for why one Australian (Brazilian, Romanian) respondent strongly agrees that incomes should be made more equal, and why another Australian (Brazilian, Romanian) respondent strongly disagrees. This is because our measure of economic inequality is constant within each country. The difference in top-income shares can–at most–explain the difference in mean levels of agreement for income equality across these three countries.

By matching an individual-level response to a national-level economic statistic we have effectively reduced the variation in the answers of 4,700 respondents to just three observations, to wit i) a mean level of agreement on the desirability of equal incomes and ii) a corresponding proportion of national income going to the top-10 percent of earners, for Australia, Brazil, and Romania, respectively. We can add more countries to our data set, but each additional country effectively adds just one more coordinate to the data set. Adding data from other years of the WVS is more effective, however, because each additional year effectively adds three more observations to the data set, one for for Australia, Brazil, and Romania, respectively.

  • Recommendation: To the extent possible, ensure that the dependent and independent variables are measured at the same level of aggregation or time-scale. Expand the number of aggregate units of analysis (e.g., countries, years): if the data are mainly cross-sectional, try to add more time periods; if the data are mainly time serial, add more cross-sectional units.
  1. Adding Functionally Invariant Covariates: The example above can also be used to illustrate another problem that can crop up when people build data sets that mismatch levels of analysis. The temptation is sometimes to keep adding covariates in the vain hope of isolating all possible confounds. Not only is this an impossible task, it often results in the data set including many redundant variables.

When students consider a relationship like the one between economic inequality and the demand for income equality set out in the figure above, they frequently conclude: 1) that countries differ in many ways other than the share of income going to the top 10 percent, and for that reason, 2) it is imperative to control for these differences. All manner of aggregate statistics–such as per capita GDP or the annual unemployment rate–are then added to the data set.

This approach rests on two misconceptions:

  • First, it fails to recognize that many aggregate variables will be correlated. Moreover, the fewer the number of aggregate units or time periods in the data, the stronger these correlations will tend to be. Our hypothetical data set shows the limiting case: If we add the three countries’ 2018 per capita GDP to the data set, we find that we have constructed a data set where per capita GDP cannot vary independently of the top-10 percent’s income share. A simple table makes this clear: any observation with a per capita GDP of $10,296 has–and can only have–58 percent of national income going to the Top 10 percent. It is thus impossible to discern whether Brazilians’ demand for income equality is due to the country’s economic inequality, its per capita GDP, or it’s ineffable “Brazilian-ness”!
Country % Income to Top 10% Per Capita GDP (US$)
Australia 34 63,200
Brazil 58 10,296
Romania 40 18,404
  • Second, we cannot hope to isolate all possible confounds by adding control variables because the most troubling confounds are not merely unobserved but unobservable! This is why social scientists search for quasi-experimental settings that proxy the effects of randomization: randomization ensures that unobservable confounds are distributed randomly between treatment and control groups such that they are uncorrelated with the variables in our models.

  • Solution: Do not continually add variables to your data set in the vain hope of controlling all possible confounds. In particular, avoid adding too many aggregate-level control variables because they will invariably be highly correlated and therefore redundant.

  1. Balancing Internal and External Validity We face an unavoidable trade-off between internal validity, that is, the correct identification of cause and effect, and external validity, that is, the accurate estimation of population parameters on the basis of our sample. Expansive data sets will tend to enhance external validity in the sense that estimates generated from the data set will rest on large, representative samples. However, the broad scope of the data also leaves estimates more vulnerable to confounds from unobserved heterogeneity. By contrast, focused data sets that take advantage of quasi-experimental features (e.g., randomization or exogenous shocks) of unique cases enhance internal validity.

We can rarely say that a given data set strikes the optimal balance between internal and external validity. However, as the figure below illustrates, we can define what is sub-optimal on these criteria: the scope of the sample is too limited to provide external validity (e.g., it’s confined to a single country and era), the measures are blunt, perhaps because they are too highly aggregated, and the research design lacks any quasi-experimental features (e.g., randomization or exogenous shocks) to support causal identification.

Data Project Stages: From Identification to Documentation

This section outlines the core stages involved in building a transparent, reproducible, and analytically useful data set. These stages parallel the structure of the course project and provide a scaffold for effective data management, documentation, and statistical exploration.


Stage 1 — Conceptualization and Data Sources

The first step is to define your research question, identify your units of analysis, and outline the key concepts you intend to measure. Good conceptualization also requires thinking carefully about the data sources that can operationalize these concepts.


Key Concept — Unit of Analysis

The unit of analysis is the basic entity that each row in your data set represents. It is the level at which you observe, measure, and analyze data.

Common examples include: - Individuals (e.g., survey respondents)
- Countries (e.g., country–year economic data)
- Electoral districts
- Firms, households, schools, or political parties

A clear and consistent unit of analysis is essential for merging data sets, constructing variables, and interpreting statistical relationships correctly.


Identify and Evaluate Data Sources

  • Locate data sources that correspond to your theoretical framework and empirical needs.
  • Assess whether each source measures your key concepts accurately and at the appropriate unit of analysis (e.g., country-year, district-level, individual).
  • Record where each data source comes from, including DOIs or stable URLs for online datasets.

Explain Your Identification Strategy

Briefly describe how observable variables will allow you to answer your research question. This explanation should make clear:

  • the expected relationship between your dependent variable and core explanatory variables;
  • any assumptions underlying your empirical strategy;
  • which control variables are required and why.

A short statement of identification helps ensure that your data collection is disciplined and theoretically grounded from the outset.


Key Concept — Identification Strategy

An identification strategy explains how your data allow you to answer your research question. It links your theory, your variables, and your research design into a coherent argument about what causes what.

In political science, an identification strategy usually answers three questions:

  1. What relationship are you trying to measure?
    (e.g., Do stronger parties increase turnout?)

  2. Why do you believe the independent variable affects the dependent variable—rather than the other way around?
    (This is about direction, not equations.)

  3. What potential confounders must be accounted for?
    (e.g., socioeconomic conditions, institutional differences, or historical legacies that could influence both variables.)

A good identification strategy does not require advanced econometrics. It simply requires you to state clearly why the patterns in your data can be interpreted as evidence for (or against) your theoretical claim.


Stage 2 — Data Set Building & Documentation

This stage covers the conceptual and technical tasks involved in assembling, merging, harmonizing, and cleaning data. Because these tasks involve multiple moving parts, it is best to begin with a high-level “pseudo code”—a plain-language battle plan that outlines the logical order of operations before you begin writing software code.

Prepare for Merging: Units of Analysis and Crosswalks

  • Ensure compatible units of analysis. Data sources must share a common observational unit for proper merging. If they do not, identify variables that allow alignment (e.g., creating district-year identifiers, mapping local names to standard codes).
  • Define the crosswalk (merge key). A crosswalk maps observations across data sets—such as linking country names to ISO-3 codes or matching electoral districts across years. A well-defined merge key eliminates ambiguity and prevents mismatched observations.

Merge and Harmonize Data

  • Use crosswalks and unique identifiers (e.g., country-year, individual ID) to merge data sets.
  • Resolve inconsistencies in naming conventions, temporal coverage, or missing identifiers.
  • Standardize variable names, coding schemes, and units of measurement (e.g., ensure all GDP values are in constant 2010 USD).

Key Concept — Merging vs. Harmonizing Data

Although often discussed together, merging and harmonizing are two distinct steps in data set construction.

Merging data means combining two or more data sets using a shared identifier (e.g., country–year, district ID, respondent ID).
- The goal is to bring different pieces of information into the same table.
- A successful merge requires a common unit of analysis and a reliable merge key (e.g., ISO codes, district numbers, unique IDs).

Harmonizing data happens before, during, and after merging. It makes variables compatible and comparable across sources.
- This includes standardizing variable names, coding schemes, categories, and units of measurement.
- Example: converting GDP figures into the same currency and year, aligning party names, or recoding survey responses into a consistent scale.

In short: merging combines data; harmonizing makes sure it actually fits together.


Clean the Data

  • Detect and remove duplicate entries, inconsistent formats, or out-of-range values.
  • Establish a consistent missing-data convention (e.g., NA or -99).
  • Note any systematic patterns of missingness that might reflect design decisions or data collection problems.

Document as You Go

Although full documentation will be completed later, early documentation is critical for transparency and reproducibility. At this stage, record:

  • the provenance of each data source and any secondary derivations;
  • definitions and coding conventions for key variables;
  • units of measurement and formats;
  • how missing values are coded and which variables exhibit systematic missingness.

This early documentation ensures you do not lose track of the decisions you yourself made during initial construction.


Stage 3 — Statistical Description

Once you have a clean and well-documented data set, you can begin exploring basic descriptive patterns. This stage lays the foundation for more advanced analysis.

Descriptive Statistics and Diagnostics

  • Compute means, medians, ranges, and distributions for your dependent variable and core explanatory variables.
  • Use tables and graphs to detect outliers, inconsistencies, and patterns not easily visible in raw data.
  • Examine correlations, conditional relationships, and potential nonlinearities.

Check for Missingness Patterns

  • Investigate whether missing values arise systematically—for example, because certain countries never report a statistic or because a merge failed.
  • Early detection of systematic missingness prevents analytical errors and helps you refine your data collection or coding.

Key Concept — Missingness

Missingness refers to any situation where a value in your data set is not observed. Instead of a number or category, the cell is blank, coded as NA, or marked with a placeholder like -99.

Political science data often contain missing values because: - governments do not report certain statistics every year;
- survey respondents skip or refuse questions;
- historical records are incomplete;
- a merge failed to find a matching observation.

Missing data are important because they can: - reduce the number of usable observations,
- bias results if the missingness follows a pattern (e.g., only poorer countries fail to report turnout),
- signal problems earlier in your data-building process (e.g., a failed merge or incorrect coding).

Early in a project, the goal is not to “fix” missingness but to recognize it, understand where it comes from, and decide whether it matters for your analysis.


Consider Transformations

  • Decide whether variables should be rescaled, logged, differenced, standardized, or otherwise transformed to improve interpretability or address nonlinear relationships.
  • Many of the tools for visualization and transformation are introduced in Chapters 7–10.

Stage 4 — Final Write-Up

The final stage produces a polished, replicable report that integrates your empirical findings with the logic of your research question. This includes all data, code, and documentation required for full replication.

Write a Summary of the Data Set

Before finalizing your data dictionary, prepare a narrative summary that orients future users of the dataset. This summary should cover:

  • Scope of the data: temporal, geographic, and conceptual coverage.

  • Alternative or multiple measures: where contested concepts may require supplementary indicators or robustness checks.

  • Systematic missingness and selection effects: how missing values arise and what biases they may introduce.

  • Key descriptive patterns: major trends, clusters, and distributions observed during your statistical description stage.

  • Context for the data dictionary: a brief explanation of how to navigate the detailed documentation that follows.

The final report should demonstrate transparency, reproducibility, and a coherent link between your empirical patterns and your research question.


Key Concept — Reproducibility

Reproducibility means that another researcher could take your data and code and obtain the same results you report. It is about process, not correctness.

Reproducibility is different from: - Validity: whether your measures capture the concepts you intend.
- Reliability: whether your measures would be consistent across repeated observations.

A study can be valid and reliable yet not reproducible if the data cleaning steps are undocumented or the code is incomplete.

To make your work reproducible, you must: - document all data sources and transformations,
- use clear, executable code rather than manual edits,
- organize files so that someone else (including future you) can run the entire project from start to finish.

In short: validity asks “Did you measure the right thing?”,
reliability asks “Would you get the same measurement again?”,
and reproducibility asks “Could someone else follow your steps and get the same results?”





Chapter 3: Merging, Appending, and Cleaning Data Sets

This chapter introduces students to the essential processes of merging, appending, and cleaning data sets while highlighting both the conceptual and technical aspects of data manipulation. The goal is to provide students with the minimal coding tools needed for data set construction while emphasizing key conceptual tasks:

  • Distinguishing merges from appends: Understanding when to link data sets horizontally (merging) versus stacking observations vertically (appending).

  • Ensuring harmonization: Maintaining measurement and format consistency across variables to ensure accurate merges and comparisons.

  • Cleaning data: Recognizing that “garbage in” results in “garbage out,” making data cleaning a critical step in building reliable data sets.

Although practical coding examples are provided using tools like R, Stata, and Google Sheets, remember that this course is not primarily a coding or data science course. Instead, the focus is on understanding the purpose behind each manipulation and ensuring data integrity throughout the process.

Expectations: Basic Data Manipulation in Stata and R

Most students come into this course having taken one statistics course that has provided them with some exposure to a statistical software package, typically either R or Stata. Hence, I assume that students have a basic capacity to manipulate variables in the software of their choice. This includes the following basic commands:

Purpose Stata Command R Command
Create new variable generate mutate
Conditional modification if ifelse
Rename variable rename rename (from dplyr)
Label variables label attributes()
Filter rows keep if / drop if filter

These commands should allow students to create new variables, conditionally modify values, rename variables, apply labels, and filter cases.

Getting Help

You can always learn more about a given function by calling on the online documentation in Stata and R.

  • In Stata, help files can be accessed using:
  help <function_name>

Example:  

  help generate

In R, help files can be accessed using:

  ?<function_name>

Example:  

  ?mutate

While I do not assume familiarity with more advanced functions, students should be comfortable using these basic commands as a foundation for more complex data manipulation tasks. For a refresher on these basic commands, see these Stata cheat sheets and these TidyR and dplyr cheat sheets. This post on janitor is also helpful.

The Basics of Merging and Appending

Merging and appending are two common techniques for combining data sets, but they serve distinct purposes:

  • Merging refers to joining two or more data sets horizontally by matching rows based on common identifying variables, often referred to as the cross-walk or merge key. These keys can include variables such as country names, years, or individual IDs. The resulting data set contains variables from all merged data sets. Merging is typically used to add useful variables from one data set to another. In the table below, the variable, ID, is used as the cross-walk to merge the “primary” and “secondary” data sets.
Primary Data Set Secondary Data Set Merged Data Set
\(ID_i, y_i\) \(ID_i, x_i\) \(ID_i, y_i, x_i\)
\(ID_j, y_j\) \(ID_j, x_j\) \(ID_j, y_j, x_j\)
\(ID_k, y_k\) \(ID_k, x_k\) \(ID_k, y_k, x_k\)
  • Appending refers to stacking data sets vertically, combining observations from data sets with the same structure (i.e., same variables and formats). Appending is often used to add observations to a time series or panel. In the example below, the primary data set contributes observations for years 1-3, and the secondary data set contributes observations for years 4-6:
\(Year_t\) Variable Source Data Set
\(Year_1\) \(y_1, x_1\) Primary Data Set
\(Year_2\) \(y_2, x_2\) Primary Data Set
\(Year_3\) \(y_3, x_3\) Primary Data Set
\(Year_4\) \(y_4, x_4\) Secondary Data Set
\(Year_5\) \(y_5, x_5\) Secondary Data Set
\(Year_6\) \(y_6, x_6\) Secondary Data Set

Merging Data Sets

Key Steps in Merging Data Sets:

1. Identify at least two data sets:

One data set typically contains the dependent variable Y, and the other a set independent variable(s) X. It is common to label one of these data sets the “master”, “primary”, or “home” data set, and the other the “using”, “foreign”, or “secondary” data set.

2. Construct the Cross-Walk in All Data Sets

The success of a merge operation depends on constructing a cross-walk or merge key that correctly and uniquely pairs records across data sets. This means ensuring that \(Y_i\) from the primary data set is matched with \(X_i\) in the secondary data set, \(Y_j\) is matched with \(X_j\), and so on.

To construct a reliable cross-walk, you must address several key issues:

  • Unique Identification: The cross-walk must uniquely identify records in all contributing data sets. If a cross-walk does not uniquely define each record, merging will result in duplicate or mismatched observations.

  • Duplicate Observations: These arise when the cross-walk is incorrectly defined. For example, if you merge based only on country names, but one or more data sets contain multiple records for a given country (e.g., multiple years or different data sources), the merge will fail or produce unintended duplicates.

    • Both Stata and R provide utilities to check for duplicates based on combinations of variables (duplicates report in Stata, distinct() in R’s dplyr).
  • Missing Data: If records exist in one data set but not the other, they may appear as missing after the merge or disappear entirely, depending on the type of merge or join operation used. Ensuring full coverage of cross-walk values in both data sets minimizes this issue.

    • Missing values in the cross-walk variables themselves can be misinterpreted as duplicates, leading to erroneous results. This is because Stata treats all cases with missing values (.) as identical in duplicate checks. Similarly, R treats NA values in the cross-walk as identical unless explicitly handled.

      Avoid this problem by ensure that cross-walk variables themselves do not contain missing values before merging. In Stata, use misstable summarize var1 var2 to assess missingness, where var1 is one variable of the cross-walk and var2 is a second cross-walk variable. In R, use summarise(across(c(var1, var2), ~sum(is.na(.)))) from dplyr or colSums(is.na(data[c("var1", "var2")])) in base R to check for missing values.

  • Standardizing Variable Names, Formats, and Measures: To avoid mismatches, you must harmonize variable names, data formats, and measurement units across data sets. For example:

    • If one data set encodes country names as full strings (e.g., Canada, Brazil), while another uses 3-digit ISO codes (124, 076), the merge will either fail or produce erroneous matches.
    • If date variables are labeled as "date" in one data set but "DATE" in another, the software will not recognize them as the same variable, causing merge errors.

For researchers working with cross-national data, the AppendIDs package in R (maintained by USC’s SPEC Lab) provides tools to standardize country names and identifiers across data sets with differing country-coding schemes (e.g., Correlates of War, Gleditsch-Ward). It is especially useful when constructing or harmonizing a cross-walk to ensure correct merges in cross-sectional and time-series contexts.

3. Iterate through merging and cleaning

Linking data sets often requires repeated adjustments to resolve mismatches or discrepancies. I strongly advise checking, cleaning, and harmonize variables in the merge key in both data sets prior to merging. Use functions in R and Stata to inspect and handle missing values.

Exercise
- The replication data for Besley & Burgess (2002) is stored in three data sets: Calamity.dta, Media.dta, and Politics.dta. 

- These data sets and their respective codebooks are on Canvas. 

- Merge Calamity.dta and Media.dta using the variables `State` and `Year` as the cross-walk.  

- Repeat the process in both Stata and R using the code below

Stata and R code to merge data sets

The Stata code to merge data sets is:

// Merging in Stata

//Load the "master" data set into Stata  
     
  use Calamity.dta, clear
      
//Merge the using data set to the master on a 1-to-1 basis
     
  merge 1:1 State Year using Media.dta
  
//Evaluate the merge via:
  
  tab _merge
    

Tip: Stata generates a variable _merge after a merge operation. You can tabulate that variable to evaluate the result of the merge; it will tell you how many records were matched and how many were not. You will need to drop _merge before merging the same data set again.

In R’s terminology merges are called “joins”. You can join Calamity.dta and Media.dta using the code below.

#joining in R

# Call haven library to support import of Stata data sets
  library(haven)

  Calamity <- read_dta("Calamity.dta")
  Media <- read_dta("Media.dta")
  
# Call dplyr library to support _join
      library(dplyr)

# A left_join(): Keeps all rows from the left data set. 
      merged_data <- left_join(calamity_data, media_data, by = c("State" = "State", "Year" = "Year"))

I do not recommend using Sheets (or Excel etc.) to merge large data sets. However, for small data sets, you can follow these steps:

Merging in Google Sheets
- Step-by-step: 
    1. Use the VLOOKUP() function to match records based on the key variable. 
    2. Ensure consistency in formatting (dates, strings) across sheets. 

Types of Merges and Joins

Merges can take several forms, depending on the relationship between observations in the data sets being combined:

  • In Stata:
    - 1:1: Matches unique observations. 
    - m:1: Matches multiple observations in the master data set to a single observation in the using data set. 
    - 1:m Matches multiple observations in the master data set to a single observation in the using data set.
    - m:m  Matches multiple observations in the master data set to multiple observations in the using data set.
  • In R:

    - left_join(): Keeps all rows from the left data set. 
    - right_join(): Keeps all rows from the right data set. 
    - inner_join(): Keeps only matching rows. 
    - full_join(): Keeps all rows from both data sets.

Stata’s merge m:m and R’s full_join options may seem odd, but they can serve as handy ways to generate all possible combinations of records from all contributing data sets. A political science example that calls on this functionality is an international relations data set of country-to-country dyads.

Appending Data Sets

Appending involves stacking data sets vertically, combining observations from multiple data sets that share the same structure (i.e., same variables). While appends are relatively easy to perform using software, they can introduce significant issues if not handled carefully.

Variable Name Mismatches

Any mismatch between variable names across data sets can result in large patches of missing data in the combined data set. For example, if a variable is labeled X in data set 1 and x in data set 2, both variables will appear in the appended data set, but with missing values in rows corresponding to the other data set. The table below illustrates this problem. Ensuring consistency in variable names and formats is critical to avoiding this issue.

ID Appended Data set
X (from Data set 1) x (from Data set 2)
1 10 NA
2 15 NA
3 20 NA
4 NA 25
5 NA 30
6 NA 35

Variable Harmonization

Variable harmonization is essential when appending data sets, as appending assumes that variables represent the same concept, measurement, and format across data sets. Issues arise when variables are labeled identically but differ in meaning or units—for example, one data set records income in thousands of dollars while another records it in whole dollars, or income may be derived from administrative records in one data set and survey responses in another. Harmonization requires careful attention to metadata, definitions, and scaling.

Stata and R code to merge data sets

Appending in Stata

// Appending in Stata

//Load the "master" data set into Stata  
     
  use master.dta, clear

//append the using data set  
  
  append using additional_data.dta, generate(newvar)
    
//the generate(newvar) option tags observations by contributing data set    

Appending in R

In R terminology, we bind data sets to stack one on top of another.

      combined_data <- bind_rows(dataset1, dataset2)

Other Important Operations: Transposing, Collapsing, Expanding

Beyond merging and appending, data manipulation often involves reshaping data to fit analytical needs. Three key operations that facilitate this are transposing, collapsing, and expanding. These transformations allow researchers to restructure data sets for improved readability, efficient computation, and compatibility with specific statistical estimators.

Transposing: Converting Rows into Columns and Vice Versa

Transposing a data frame restructures data by converting observations into variables (wide format) or variables into observations (long format). This is particularly useful for working with panel data, time series, and certain regression models. indeed, my experience is that many statistical agencies provide their data in wide format csv files. See the UNDP Human Development Index Time Series. If we wanted to merge this to a TSCS in long format–Boix, Miller, & Rosato’s Democracy dataset, for example–we would first have to transpose one of the two data sets.

Reshaping in Stata: A wide-to-long example:


//Syntax (wide to long): reshape long stub, i(var1) j(var2)
  *stub example:income_ is stub of variables, income_1990, income_1991...
  *var1 identifies cross-sectional units
  *var2 is a new variable that Stata creates to index time
  
//Example: convert UNDP HDI time series to long format

   import delimited "HDR23-24_Composite_indices_complete_time_series.csv"

//Keep a subset of variables to simplify example:
  
    keep iso3 country hdi_*
    drop hdi_f* hdi_m* hdicode hdirank

//reshape data frame. Note: year was not previously a variable in the data set
  
  reshape long hdi_, i(iso3) j(year)

//return to wide via:
     
  reshape wide   

Reshaping in Stata: A long-to-wide example:

This example converts a long data set into a wide format data set.

//Syntax (long to wide): reshape wide stub, i(var1) j(var2)
  *stub: variable that receives var2 suffix to link it to a time period
  *var1 identifies cross-sectional units
  *var2 identifies time periods

//Example: convert BMR Regime data to wide format

//Load data

  clear
  use use "BMR Regime Data.dta"

// Change a variable name to make it parallel to other variable names in file
  
  rename democracy democracy_

//Reshape the data ussing ccode to identify units:
    
  reshape wide democracy_, i(ccode) j(year)
  
//Stata returns an error; I investigate via:
    
  reshape error

//The problem: ccode constant even when country name changes!
    
//omit problematic variables and use country name to identify units:
    
    drop ccode abbreviation abbreviation_undp
  
// Now it works:
    
    reshape wide democracy_, i(country) j(year)

  //return to long via:
     
  reshape long

pivot_long in R (tidyr): A wide-to-long example:

This code chunk uses tidyr’s pivot_long command to repeat the wide-to-long transformation of the UN HDI data set.

#load the UNDP HDI time series  
  library(haven)
  
  HDI <- read.csv("HDR23-24_Composite_indices_complete_time_series.csv", header=FALSE)

#call tidyr & dplyr libraries
  library(tidyr)
  library(dplyr)
  
#simplify data set. This e.g., uses some REGEX code (see below)
  
  HDI <- HDI %>% select(iso3, country, matches("^hdi_\\d{4}$"))

HDI_long <- HDI %>%
  pivot_longer(cols = starts_with("hdi_"), 
               names_to = "year", 
               values_to = "hdi") %>%
  mutate(year = as.integer(sub("hdi_", "", year)))  # Convert 'hdi_YYYY' to numeric year
#running the above code without the line above is instructive; try it.
 
 View(HDI_long)

pivot_wide in R (tidyr): A wide-to-long example:

This code chunk repeats the BMR wide-to-long example in R (tidyr) using the pivot_wide function.

#Load the data
  library(haven)
  
  BMR_Regime_Data <- read_dta("BMR Regime Data.dta")
  
  View(BMR_Regime_Data)
  
#We know from Stata example that some variables are problematic; drop them:
  
 BMR_Regime_Data <- BMR_Regime_Data %>% select(-ccode,-starts_with("abbreviation"))

#rename variable to ensure parallel variable names:
 
 BMR_Regime_Data <- BMR_Regime_Data %>% rename(democracy_ = democracy)
 
# Load necessary libraries to pivot
library(tidyr)
library(dplyr)

# Pivoting the dataframe from long to wide format
BMR_wide <- BMR_Regime_Data %>% pivot_wider(
  names_from = year,  # Column names will come from the "year" variable
  values_from = c(democracy_, democracy_trans, democracy_breakdowns, democracy_duration, democracy_omitteddata, democracy_femalesuffrage)  # Values will be taken from the "democracy" variables
)

View(BMR_wide)

Collapsing: Aggregating Data into Summary Statistics

Collapsing reduces a data set by aggregating multiple observations into summary statistics, such as means or totals. This is useful for creating summary tables or analyzing trends at a broader level.

Stata Collapse Example:

This code example collapses Belsey & Burgess’s state-level data on flood damage to provide a annual time series of flood damage for India as a whole.


//Load Calamity.dta

  use "Calamity.dta"

//sum flood area, affected population, and economic cost for each year   

  collapse (sum) fafarea fafpop ftotv , by( year)

The sum option can be changed to obtain many other quantities: means, counts, quantiles, etc.

R Group_by Example (dplyr):

This repeats the previous Stata example in R, using dplyr's group_by function.

# Call haven library to support import of Stata data sets
  library(haven)

  Calamity <- read_dta("Calamity.dta")

# Call dplyr library to support group_by
      library(dplyr)
  

India_Floods <- Calamity %>%
  group_by(year) %>%
  summarise(
    sum_area = sum(fafarea, na.rm = TRUE),
    sum_pop = sum(fafpop, na.rm = TRUE),
    sum_total_v = sum(ftotv, na.rm = TRUE)
  )

This code groups the data by year and computes the sum of fafarea, fafpop, and ftotv, removing missing values. As with Stata’s collapse, tidyr’s summarise can compute a variety of quantities.

Expanding: Generating New Observations from Existing Data

Expanding creates additional rows, often filling in missing time periods, interpolating values, or simulating counterfactuals. This is useful for panel data, time-series analysis, or simulations.

Stata Expand Example:

 //Load Calamity.dta

  clear
  use "Calamity.dta"
  
//Create 4 copies (one original, three duplicates) of each observation

   expand 4

//Create a variable, quarter, that records the quarter for each state-year
  
  by state year copy, sort: gen quarter = _n

This code chunk effectively quadruples the data set. Such an operation would be useful, for example, if we wanted to expand an annual time series so that we could merge quarterly data to it. Indeed, the last line creates variable(quarter) using Stata’s system variable _n, which counts the row position within each state-year group. As a result, quarter takes values 1 through 4.

R Slice Example (tidyr):

This code repeats the previous Stata example, using dplyr's slice function.

# Call requisite libraries
library(haven)
library(dplyr)
library(tidyr)

# Load the data set
Calamity <- read_dta("Calamity.dta") 

# Expand each observation to four copies (original + 3 duplicates)
Calamity_expanded <- Calamity %>%
  slice(rep(1:n(), each = 4)) %>%  # Replicates each row 4 times
  group_by(state, year) %>%
  mutate(quarter = row_number())  # Generates quarters,1-4
  
# View the expanded data set
View(Calamity_expanded)

One can amend these code chunks to create as many duplicates as required. A note of caution: R also has an expand function–but it is used to generate combinations of missing values, not duplicate existing records.

Cleaning and Standardizing Data Sets

Cleaning data involves detecting and fixing errors, inconsistencies, and formatting issues that could compromise analyses.

Handling Missing Values

Missing data can arise from various causes, such as non-responses, data entry errors, or differences in data collection methods. If missing values are not randomly distributed, they can introduce selection bias, potentially distorting statistical estimates and leading to invalid conclusions. Accordingly, any indication of systematic patterns of missingness are of special concern. Such patterns can arise because of i) errors in data manipulation, such as incorrect merges, appends, or transformations, or ii) non-random selection, where certain observations are systematically excluded. Hence, one of the most important concerns when cleaning data is identifying and managing missingness patterns.

As a guiding principle: what’s excluded from your data set is as important as what’s included—but not nearly so visible! This underscores the importance of scrutinizing exclusions caused by missing data, transformation errors, or selection bias.

Managing missing values effectively involves several key steps:

  1. Identify missing values and different missing value codes: Determine what constitutes a missing value in the data set and whether different codes are used to represent various types of missingness (e.g., NA, -99, or . in Stata). For example, a survey question on household income might have two different missing codes: -99 to indicate that the respondent refused to answer and -88 to indicate that the respondent didn’t know the answer. These codes represent distinct types of missingness, each of which could have different implications for analysis. Consistent coding is essential to avoid errors during analysis.

  2. Ensure that missing values are documented in the data dictionary.

  • Special consideration for Stata: Be aware that Stata treats missing values (e.g., ., .a, .b, etc.) as large positive numbers,such that . < .a < .b ...< .z. For example, replace x = 3 if x > 6 would mistakenly assign the value 3 to cells containing missing values unless explicitly excluded with an additional condition like & x != .
  1. Assess the patterning of missingness: Use summary statistics and diagnostic tools, such as Stata’s misstable summarize command or visual tools like missingness maps or bar plots, to detect where missing values occur and whether they exhibit systematic patterns. For a simple diagnostic, create a contingency table treating missingness as a category. In Stata, you can use:
           tab x y, miss

This will display a table where missing values are treated as a separate category, helping you visualize their relationship with other variables.

In R, you can achieve similar results using:

          table(is.na(x), is.na(y))

This command will tabulate missingness in both variables, providing a simple overview of their interdependence.

  1. Evaluate the implications of missingness: When missing values are concentrated in specific variables or observations, assess the potential for error or bias. First, check whether you have made an error in transforming or otherwise manipulating the variable. Second, consider the possibility of selection bias due how the variable was conceptualized or the sample selected. For example, a variable that measures an incumbent’s tenure in office would be censored in jurisdictions with term limits. Naive comparisons to jurisdictions where incumbents do not face term limits would be misleading, and it would therefore be important to amend the data set and any accompanying documentation to flag cases that employ term limits.

    Fundamentally, we want to determine whether the missingness is ignorable (because it is as if at random) or non-ignorable (because it is not random). This distinction can never be established empirically because we never observe the counterfactual, fully observed distribution of the variable affected by missingness. However, we can, for example, assessing whether observed variables, x and z, predict missingness in y. Such a test would tell us whether y’s missingness is correlated with x and z. These tests are hardly dispositive, but they can signal potential problems and solutions.

  2. Determine appropriate handling strategies: Based on the nature and extent of missingness, decide on a course of action. Some options include the following:

  • Flag variables or cases with extensive missingness for further investigation. At the limit, you may want to remove the variable from the data set.

  • Exclude observations with missing values if they are minimally observed or randomly distributed. Be aware that this is often done automatically by statistical software programs. Analyses that are based only on fully observed cases are said to employ listwise deletion. Analyses that exclude observations on a case-by-case basis, depending on the specific variables missing in each analysis, are said to employ casewise or pairwise deletion. As a general rule, employ listwise deletion unless you are sure that the missingness meets assumptions for reliable imputation.

  • Impute missing values. Be aware that imputation of missing data is a complex topic, and it should not be done lightly. Incorrect imputation can introduce further bias and distort statistical results. Even if you do impute missing values, a copy of the data set that shows the original missingness should be archived alongside the imputed data set(s)! This allows third parties to test the implications of alternative imputation methods.

Standardizing variable names

Ensure consistent naming conventions across data sets. Ch. 5 discusses variable naming protocols in greater detail. To ensure clarity and consistency in data sets, it is important to distinguish between variable labels and value labels:

  • Variable labels: Provide a descriptive name for a variable, making its meaning clear (e.g., “Parliamentary Cabinet’s Left-Right Position”).

  • Value labels: Describe the specific values within categorical variables (e.g., “1 = Male,” “0 = Female”).

Best practices include:

  • Use intuitive and/or systematic variable names to facilitate coding and manipulation.

  • Ensure consistency in capitalization and formatting (e.g., all lowercase) of strings, currency units, dates.

  • Avoid excessive reliance on acronyms unless widely understood (e.g., GDP).

  • Use value labels to clarify categorical data.

For example, a variable named LEFT_RIGHT might have a variable label “Parliamentary Cabinet’s Left-Right Position” and be coded with continuous values from -1 (most left-leaning) to 1 (most right-leaning).

Practical Advice for the Cleaning Process

Errors are inevitable during data cleaning, but with the right approach, they can be efficiently managed and corrected. Here are some practical tips:

  • Inspect documentation first: Always begin by thoroughly reviewing any available documentation, such as code books and metadata, to understand the structure, coding, and expected content of the data on which your data set is based.

  • Iterate and debug: Merging, appending, and harmonization often require multiple rounds of cleaning and adjustments. Be prepared to revisit earlier steps as new issues arise.

  • Preserve original data sets: Never overwrite your original files. Always maintain backups to ensure you can revert to the original data if needed.

  • Avoid command-line coding for major cleaning tasks: Coding interactively from the command line may seem faster, but it lacks reproducibility and makes it harder to recover from mistakes. A structured script not only saves time in the long run but also enhances reproducibility and transparency. Hence…

  • Write and execute well-commented batch code: Instead of coding directly from the command line, write your data manipulations in a Stata do-file or an R Markdown script. This approach ensures that you can easily return to the original data sets and restart the cleaning process if necessary. Well-commented code helps you track your logic, understand transformations, and identify errors quickly.

Summary

This chapter has covered five fundamental data manipulation operations:

  1. Merging (joining) data sets to integrate variables from one data set into another.

  2. Appending (binding) data sets to add observations from one data set to another.

  3. Transposing data sets to facilitate restructuring between long and wide formats.

  4. Collapsing data sets to generate aggregate statistics.

  5. Expanding data sets by duplicating records to enhance granularity or facilitate merges when variables are measured at different levels of aggregation or across different time series.

Along the way, we have also explored how to drop, rename, and sort variables. While Stata and R offer many additional data manipulation functions, the operations introduced here form the core toolkit for working with data frames effectively.

We ended with a discussion of cleaning practices. Merges and appends and other complex dataframe operations rarely work the first time. Follow up every such operation with an inspection of the data set. Pay special attention to systematic patterns of missingness. Get in the habit of writing batch code that attaches intuitive and systemic variable and value labels to the variables in the data set.

Practical Example: Harmonizing and Appending Canadian Election Studies

This example demonstrates how to harmonize and append data sets using Stata. The example focuses on income variables in the panel portion of the 1974-1980 Canadian Election Study (CES). These variables present two major obstacles to harmonization:

  1. The value of money changes over time: Income values from earlier years need to be adjusted for changes in purchasing power to make meaningful comparisons.

  2. Categorical income variables with different coding schemes: The surveys use different income bands for each year, meaning that the same numeric code (e.g., 1) may refer to vastly different income ranges depending on the survey year. For example, a 1 in the 1974 survey refer to incomes in the \(\$0–2999\) range, while a 1 in the 1979 survey may refer to a higher income category, such as \(\$0–7500\).

Without harmonization, combining these variables would produce an incoherent measure of income, leading to spurious analyses.

The following Stata code restructures the 1974-1980 Canadian Election Study, harmonizes income variables, and adjusts them for inflation before appending them into a coherent data set.


//It is good practice to title and date your code

// Appending & Harmonizing data in Stata   
// Author: C. Kam                        
// Revised: 9 Feb 2025                   

// Load data

clear
use "CES_1974_1980.dta"

// Keep respondents who in both 1974 and 1979 waves
numlabel, add

tab v4020 v4014, miss

keep if v4020 == 1 | v4014 == 1

//Check that CASEID is unique, then duplicate it

duplicates list CASEID

//Create 2 records per respondent, label them by election year

expand 2, gen(duplicate)
gsort CASEID duplicate
by CASEID, sort: gen nth_record = _n
gen YEAR = 1974 if nth_record == 1
replace YEAR = 1979 if nth_record == 2

// Create harmonized income variable; I use a string for flexibility
gen INCOMEs = ""
replace INCOMEs = "0-2999" if v478 == 1 & YEAR == 1974
replace INCOMEs = "3000-4999" if v478 == 2 & YEAR == 1974
replace INCOMEs = "5000-7499" if v478 == 3 & YEAR == 1974
replace INCOMEs = "7500-9999" if v478 == 4 & YEAR == 1974
replace INCOMEs = "10000-14999" if v478 == 5 & YEAR == 1974
replace INCOMEs = "15000-16999" if v478 == 6 & YEAR == 1974
replace INCOMEs = "17000-19999" if v478 == 7 & YEAR == 1974
replace INCOMEs = "20000+" if v478 == 8 & YEAR == 1974

replace INCOMEs = "0-7500" if v1516 == 1 & YEAR == 1979
replace INCOMEs = "7500-9999" if v1516 == 2 & YEAR == 1979
replace INCOMEs = "10000-14999" if v1516 == 3 & YEAR == 1979
replace INCOMEs = "15000-16999" if v1516 == 4 & YEAR == 1979
replace INCOMEs = "17000-19999" if v1516 == 5 & YEAR == 1979
replace INCOMEs = "20000-24999" if v1516 == 6 & YEAR == 1979
replace INCOMEs = "25000-29999" if v1516 == 7 & YEAR == 1979
replace INCOMEs = "30000+" if v1516 == 8 & YEAR == 1979

// I use some string operation to extract floor and ceiling of each income bracket.

split INCOMEs, p("-") gen(ends)
destring ends1, gen(floor) ignore("+")
destring ends2, gen(ceiling) ignore("+")

// Convert income categories to midpoints

gen midpoint = (floor + ceiling) / 2 if floor ~= . & ceiling ~= .

//Top-brackets have no ceiling! For simplicity, assume ceiling = floor * 1.25 percent 
gen midpoint = (floor + (floor*1.25)) / 2 if floor ~= . & ceiling == .
  
// Inflation adjustment: convert nominal to real 2024 dollars
// I obtained conversion rates from StatsCanada:

gen midpoint2024 = midpoint * 6.1412 if YEAR == 1974
replace midpoint2024 = midpoint * 4.0225 if YEAR == 1979

rename midpoint2024 INCOME
label var INCOME "Annual Income (HH-Real 2024$)"

//Relative incomes are another way to harmonize these data
// Generate income quintiles

xtile income_1974 = INCOME if YEAR == 1974, nq(5)
xtile income_1979 = INCOME if YEAR == 1979, nq(5)

gen INCOME_QNTL = income_1974 if YEAR == 1974
replace INCOME_QNTL = income_1979 if YEAR == 1979

// Final cleanup and variable ordering
order CASEID YEAR INCOME*, first
edit CASEID YEAR INCOME*

Here is R code that accomplishes the same objective.

# R Code for Harmonization and Appending:
# Loading required libraries
library(tidyverse)
library(janitor)
library(haven)

# Load the data sets
ces74to80 <- read_dta("CES_1974_1980.dta")

# Restructure the 1974-80 data by isolating respondents who answered both years
ces74to79 <- ces74to80 %>% 
  filter(v4020 == 1 | v4014 == 1) %>% 
  mutate(nth_record = row_number(), 
         YEAR = ifelse(nth_record %% 2 == 1, 1974, 1979))

# Respondents who answered neither income question are irrelevant
ces74to79  <- ces74to79  %>%
  filter(!(is.na(v478) & is.na(v1516)))

#I have to convert the income variables to factors with their associated labels:
ces74to79  <- ces74to79 %>%
  mutate(
   v478 = factor(v478, levels = attr(v478, "labels"), labels = names(attr(v478, "labels"))),
   v1516 = factor(v1516, levels = attr(v1516, "labels"), labels = names(attr(v1516, "labels")))
  )

# Harmonize income variables for 1974-1979 & check them
ces74to79 <- ces74to79 %>%
  mutate(
    income1974_str = as.character(v478),
    income1979_str = as.character(v1516)
  )
tabyl(ces74to79$income1974_str)
tabyl(ces74to79$income1979_str)

ces74to79 <- ces74to79 %>%
  mutate(
    income_str = ifelse(YEAR == 1974, income1974_str, income1979_str)
  )

#This will give you a peak at some common string manipulation functions in Tidy R

ces74to79 <- ces74to79 %>%
  mutate(
    income_clean = str_remove_all(income_str, "\\$") %>%    # Remove all "$"
      str_remove_all(" PER YEAR") %>%                       # Remove " PER YEAR"
      str_replace("LESS THAN", "0 -") %>%                   # Replace "LESS THAN" with 0
      str_replace("OR MORE", "-") %>%                       # Replace "OR MORE" with -
      str_replace_all("\\s*-\\s*", "-")                     # Eliminate spaces around dash
  )

#Check result
tabyl(ces74to79$income_clean)

# Split income_clean into floor and ceiling of income bracket
ces74to79 <- ces74to79 %>%
  separate(income_clean, into = c("floor", "ceiling"), sep = "-", convert = TRUE, extra = "merge") 

# Convert income bands to numeric midpoints and adjust for inflation
ces74to79 <- ces74to79 %>%
  mutate(
    midpoint = case_when(
      !is.na(floor) & is.na(ceiling) ~ (floor + (floor * 1.25)) / 2,  # When only floor observed
      !is.na(floor) & !is.na(ceiling) ~ (floor + ceiling) / 2, # When floor & ceiling observed
       ),
    midpoint2024 = case_when(
      YEAR == 1974 ~ midpoint * 6.1412,
      YEAR == 1979 ~ midpoint * 4.0225
    )
  )

#Rename to INCOME
ces74to79 <- ces74to79 %>%
  rename(INCOME = midpoint2024) 

#Income quintiles
ces74to79 <- ces74to79 %>%
  mutate( 
    INCOME_QNTL = case_when(
      YEAR == 1974 ~ ntile(INCOME, 5),  # Quintiles for 1974
      YEAR == 1979 ~ ntile(INCOME, 5),  # Quintiles for 1979
      TRUE ~ NA_real_                  # Assign NA for other years
    )
  )

Key Takeaways:

  • Restructuring and harmonizing variables is essential before merging and appending data sets.

  • Inflating nominal values to real terms allows for meaningful time-series analysis.

  • Use code comments to document each harmonization step to ensure reproducibility and reliability.

Appendix: Manipulating Strings and Dates

Data come in many forms, including numeric, string, and date types. We have primarily worked with numeric data thus far, but it’s common to encounter strings and dates. Indeed, the last coding example employed a string variable and several string manipulation to harmonizing incomes in the Canadian Election Study. These types of techniques are vital when dealing with data sets that contain text entries or time-based data. This appendix introduces a few basic but essential string and date operations.

What is a String?

A string is a sequence of characters that can include letters, numbers, and symbols. Strings are common in data sets representing names, addresses, identifiers, and more. However, unlike numeric data, strings require specific operations for manipulation, such as concatenation, substring extraction, and pattern matching.

Common String Operations

Below are several essential string operations commonly performed in Sheets, R, and Stata:

Substring Extraction

If a string has a regular structure–a postal code or web address, for example–then one can extract a portion of that string using the following commands:

  • Sheets/Excel: MID(text, start_position, num_characters) 
  • R: substr(text, start, stop) 
  • Stata: substr() 
Example: To extract the first three characters of a string in Google Sheets, use:
=LEFT(A1, 3)

String Concatenation

To concatenate is to join objects together end to end, as in a chain or series. multiple strings into one.

   • Sheets/Excel: CONCATENATE(value1, value2, ...) or A1 & B1 
   • R: paste(value1, value2, sep = " ") 
   • Stata: concat() 
Example: In Sheets, concatenate cells A1 and B1 with a space:
=A1 & " " & B1

Whitespace Trimming

Remove extra spaces from a string.

  • Sheets/Excel: TRIM(text) 
  • R: str_trim() 
  • Stata: trim() 
Example: In Sheets, trim excess whitespace:
=TRIM(A1)

Case Conversion

Convert string from lower to upper case, or vice versa.

• Sheets: UPPER(), LOWER() 
• R: tolower(), toupper() 
• Stata: lower(), upper()

String Splitting

Split a string into substrings based on a delimiter.

• Sheets: SPLIT(text, delimiter) 
• R: strsplit() 
• Stata: split()

Pattern Matching (Regex)

Regular expressions allow complex substring searches or replacements.

• Sheets: REGEXEXTRACT(), REGEXREPLACE() 
• R: grep(), gsub() 
• Stata: regexm(), regexr() 
Example: To extract Canadian postal codes in Sheets:

  =REGEXEXTRACT(A1, "[A-Z]\d[A-Z] \d[A-Z]\d")

where cell A1 contains the postal code.

Other spreadsheet programs, such as Libre Office and Excel, do not necessarily share the same REGEX functions. However, searching REGEX in those programs will bring up similar functions.

A cheat sheet for REGEX in R is is here

A cheat sheet for REGEX in Stata is here

You can test and debug your REGEX commands at this site or use an AI, like ChatGTP to help configure your REGEX commands.

REGEX commands can be used to extract key words from text using the NELDA data set. This is a useful operation if one wishes to recast elements of a text into a categorical variable, for example. We will use the NELDA data set to illustrate how this can be done.

Practical Example: Using REGEX in Stata
//Load the NELDA sata set
  clear
  use "id & q-wide_share.dta"
  
//Explore some variable; note that they are strings  

  describe notes nelda1 nelda1notes
  
  list  notes 
  
  list nelda1 nelda1notes if nelda1notes != ""
  
// Use REGEX notation to isloate elections linked to coups
  
  list country year notes if regexm(notes, "coup")
  
// Use REGEX notation to isloate elections marred by violence
  
  list country year notes if regexm(notes, "\b(violen(?:ce|t|tly)?|intimidat(?:es|ed|ing|ion)?|harass(?:es|ed|ing|ment)?)\b")   

// REGEX captures "violen[t]", "initimidat[ion]", "harass[ment]"

// Use above to create a dummy variable for violent elections
  
  gen violent = regexm(notes, "\b(violen(?:ce|t|tly)?|intimidat(?:es|ed|ing|ion)?|harass(?:es|ed|ing|ment)?)\b")
  
  tabulate violent, miss  
Practical Example: Using REGEX in R

Here is the previous task replicated in tidyr using the str_detect function. This code can be amended to make use of Base R’s grepl.

#Load necessary libraries

 library(haven)
 library(stringr) #part of TidyR--but loaded to highlight it  
 library(dplyr)
 library(janitor)

#Upload NELDA data

  id_q_wide_share <- read_dta("id & q-wide_share.dta")

  View(id_q_wide_share)

# Isolate the elections marred by violence  
  violent_elections <- id_q_wide_share %>%
       filter(str_detect(notes,   "\\b(violen(?:ce|t|tly)?|intimidat(?:es|ed|ing|ion)?|harass(?:es|ed|ing|ment)?)\\b")) %>%
       select(country, year, notes)

  View(violent_elections)

# create a violent elections dummy  
id_q_wide_share <- id_q_wide_share %>%
  mutate(violent_dummy = ifelse(str_detect(notes, 
      "\\b(violen(?:ce|t|tly)?|intimidat(?:es|ed|ing|ion)?|harass(?:es|ed|ing|ment)?)\\b"), 1, 0)) 

# one-way table of the violent elections dummy
  tabyl(id_q_wide_share$violent_dummy)

How Dates are Encoded

Dates are critical for time-based data analysis, but they are encoded differently across software packages:

  • Google Sheets and Excel count days from December 30, 1899.

  • Stata dates are centered on January 1, 1960.

  • R uses January 1, 1970.

This difference in encoding implies that dates must often be converted or standardized when importing data between systems. R and Stata typically manage date conversions gracefully, but exporting R or Stata data to a spreadsheet often results in dates being converted to strings.

Common Date Operations

Converting Strings to Dates

Convert text representations of dates into usable date formats.

• Sheets: DATEVALUE() 
• R: as.Date() 
• Stata: date() 
Example: In Sheets, convert "12/30/1989" to a date:
=DATEVALUE(A1)

Date Arithmetic

Perform arithmetic operations such as addition or subtraction of dates to calculate durations.

• Sheets: =A2 - A1 (difference in days) 
• R: difftime() 
• Stata: datediff() 
Example: In Sheets, to find the difference between December 31, 1990, and December 31, 1989:
=A2 - A1

Extracting Date Components

Retrieve specific parts of a date (year, month, day).

• Sheets: YEAR(), MONTH(), DAY() 
• R: format() 
• Stata: year(), month() 
Example: In Sheets, extract the year from a date:
=YEAR(A1)
Practical Example : Creating a Date from Numeric Data in Stata and R

The NELDA data set has two variables from which we can construct a date: year, a 4-digit variable, and mmdd, a variable that is either 3 or 4 digits depending on the month. Let’s convert these two variables to a more conventional date format, e.g. 31jan1989. This is a good example to work through because it requires some of the string commands that we examined previously.

My Stata code for this task is as follows:

//Load the data

  clear
  use "id & q-wide_share.dta"

//examination confirms year and mmd are numeric formats
  
  list year mmdd if _n <= 5
  
  describe year mmdd
  
//Step 1: concatenate mmdd & year into a string  
      *Stata egen commands are handy!  
    
  egen mmddyyyy = concat(mmdd year)
    
//check length of string: should be 8 characters: mmddyyyy    

  gen length = strlen(mmddyyyy)
  
  tab length, miss

//investigate the issue
  
  gsort -mmddyy
  
  list mmddyyyy if length==7 & _n<=10
  
//Months before October have mddyyyy
  
//Step 2: Fix by concatenating a leading 0  
  
  replace mmddyyyy = "0" + mmddyyyy if length == 7
  
//Check if problem is fixed:
    
    list mmddyyyy if length==7 & _n<=10

//if it is fixed, lengths should all be 8 digits    

    replace length = strlen(mmddyyyy)
  
    tab length, miss  
  
//Step 3: Use Stata date() function to convert to date format  

    gen date = date(mmddyyyy, "MDY") if length==8

//Save for 13 idiosyncratic cases, the task is complete!    
   
   list if date != .
      
   list if date == .
   

Here is the conversion from numeric variables to dates in R:

library(haven)
library(dplyr)
library(lubridate) #You may need to install lubridate if this is your first ever use

# Load the NELDA data set 
id_q_wide_share <- read_dta("id & q-wide_share.dta")

# Step 1: Convert mmdd and year to a character string and ensure correct formatting
id_q_wide_share <- id_q_wide_share %>%
  mutate(
    mmdd = sprintf("%04d", mmdd),   # Ensure mmdd is 4 digits (e.g., 708 -> 0708)
    mmddyyyy = paste0(mmdd, year)   # Concatenate mmdd and year
  )

# Step 2: Convert to date format (MDY -> ddMMMyyyy)
id_q_wide_share <- id_q_wide_share %>%
  mutate(
    date = as.Date(mmddyyyy, format = "%m%d%Y"),  # Convert string to Date
    formatted_date = format(date, "%d%b%Y")       # Convert to ddMMMyyyy format
  )

# View result
head(id_q_wide_share$formatted_date)
Exercise: String manipulations in Sheets
Recreate this data set in Sheets

| ID | Name         | Address                       | Postal Code |
|----|--------------|-------------------------------|-------------|
| 1  | Alice Smith  | 123 Elm St, Toronto, ON       | M5V 2K4     |
| 2  | Bob Johnson  | 456 Maple Ave, Vancouver, BC  | V6T 1Z1     |
| 3  | Charlie Lee  | 789 Oak Blvd, Ottawa, ON      | K1P 5N2     |

Tasks:

1. Extract the postal codes using REGEXEXTRACT. 

2. Concatenate the name and address. 

3. Convert names to uppercase using UPPER(). 
Exercise: Date manipulation in Sheets
Create the following dates in Sheets:

| Event          | Date       |
|----------------|-------------|
| Independence   | 1989-12-31  |
| Reform         | 1990-12-31  |
| Treaty Signed  | 1992-12-31  |

Tasks:
1. Calculate the number of days between events. 
2. Extract the year from each date. 
3. Add 365 days to each date using date arithmetic. 
Exercise: Using string manipulations to build a cross-walk
Merge Media.dta to CIv2006.xls (Crime in India, Annual Series, 1954-1996).

Both data sets (and codebooks) are on Canvas.

You will have to construct a cross-walk to merge the two series.

The challenge is that CIv2006.xls only contains state acronyms. 

Use some string operations to construct the cross-walk.




Chapter 4: Data Documentation

Introduction: Why Documentation is a Core Element of Data Management

Data documentation is more than just an administrative task–it is critical to ensuring that data can be used correctly and effectively by others and even by the original creator in the future. Without comprehensive documentation, data sets can be misinterpreted, leading to flawed research or errors in replication. Proper documentation integrates various elements, including metadata, syntax files, and organizational tools like README files, to provide a full picture of the data set.

This chapter introduces the core components of data documentation: the data dictionary, the README file, syntax files for derived variables, and version control logs. These elements work together to ensure that data sets are well-structured, reproducible, and compatible with diverse research needs.

Why is Documentation Important?

Proper documentation is essential for two key reasons:

  1. Enabling others to understand and use the data set: Users of the data set need to know what is included, what is not, and how the data were collected, coded, and transformed. Without this knowledge, the data set may be misinterpreted, leading to flawed analyses or conclusions.

  2. Ensuring you remember key details about the data set: Over time, you may forget how variables were defined, where the data originated, and how transformations were performed. Clear documentation serves as a personal record that helps you understand your data set even months or years later.

What Should Be Documented?

Comprehensive documentation should describe the scope, content, and key details of the data set. Key components include:

1. Data Dictionary

A data dictionary provides detailed information about each variable in the data set. For each variable, document the following details:

  • Variable name: The variable name should be intuitive and/or systematic to facilitate efficient manipulation through software commands or loops (e.g., gdp_growth, population_density). Follow these guidelines:

    • Intuitive or systematic naming: Use names that reflect the content of the variable (e.g., LEFT_RIGHT for a left-right political alignment score) or follow systematic conventions for grouping related variables (e.g., income_2019, income_2020).
    • Uniform formatting: Maintain uniform conventions (e.g., all variable names in lowercase or uppercase).
    • Avoid excessive reliance on acronyms: Unless acronyms are widely understood (e.g., GDP, CPI), opt for more descriptive names.
    • Labels vs. value labels: Distinguish between variable labels (a descriptive label for the entire variable, like “Parliamentary Cabinet’s Left-Right Position”) and value labels (used in categorical variables to label individual values, like “1 = Left,” “2 = Center,” “3 = Right”).
  • Definition and measurement: Clearly describe what the variable represents and how it was measured (e.g., whether GDP is adjusted for inflation).

  • Coding information: Specify how categorical variables are coded (e.g., 1 = Yes, 0 = No).

  • Units of measurement: Indicate the units used (e.g., values reported in millions).

  • Variable format: Note the data type (e.g., numeric, string).

  • Source or provenance: Clearly document where the variable was obtained from, especially when using secondary sources like replication data sets.

  • Missing value codes: Identify how missing values are coded (e.g., -99 or NA) and explain their significance.

Example Data Dictionary

Variable.Name Description Measurement…Units Coding.Information Variable.Format Missing.Value.Code Source Derived.Formula..if.applicable.
country Country name or code N/A ISO-3 codes (e.g., USA, CAN) String NA IMF World Economic Outlook (2023) N/A
year Year of observation Calendar year N/A Numeric NA N/A N/A
gdp_growth GDP growth rate Annual % change N/A Numeric -99 Jones & Smith (2022), replication data N/A
gov_spending Government spending (USD) Millions N/A Numeric -99 World Bank Development Indicators (2020) N/A
spending_gdp_ratio Gov spending as % of GDP Percent of GDP N/A Numeric -99 Derived from original variables using formula (gov_spending / gdp) * 100

2. README File

A README file serves as the manifest or guide for the entire data set and its supporting files. It should include:

  • Overview of the data set: A summary of its purpose, scope, and coverage (e.g., time periods and geographical regions). Any overview in a README file should be concise; just 1-2 sentences.

  • List of files included: A manifest specifying all associated files, such as CSV, Stata, or R data files, and any supplementary documents (e.g., syntax files).

  • Instructions for use: Guidance on how to load, merge, or analyze the data set, including any dependencies or software requirements. For example, if the files must be placed in a given directory, one would provide instructions to create that directory and a path to it.

  • Contact information: The author or creator’s contact details for questions about the data set.

Example README File

README: Government Spending and Economic Growth in OECD Countries, 1990-2020

This data set contains time-series cross-sectional data on government spending and economic growth for 25 OECD countries from 1990 to 2020.

The data set is available as:
  gov_spending.csv  (for general use)
  gov_spending.dta (for Stata users)

Instructions:  
  - Open gov_spending.csv using R or any spreadsheet software, or 
  - Load gov_spending.dta using Stata 15 or later. 

See transformation_script.do for details on derived variables.

3. Syntax or Do Files

Syntax or do files record the commands used to clean, transform, and generate derived variables. These files are essential for reproducibility and should include:

  • Scripts for derived variables: Document how derived variables (e.g., GDP per capita, government spending ratios, or government quality indices) were calculated.

  • Transformation and cleaning steps: Record steps like recoding, standardizing units, or reshaping data.

  • Analysis scripts: Optionally, include scripts for producing key summary statistics or graphs.

Example Syntax File

* Generate government spending as a percentage of GDP
generate spending_gdp_ratio = (gov_spending / gdp) * 100

* Recode missing values for gdp_growth
replace gdp_growth = . if gdp_growth == -99

4. Data Formats

The data set should be saved in multiple formats when appropriate. Common formats include:

  • CSV: Ensures compatibility across software platforms, but comes with limited metadata. Categorical variables will appear without value labels, for example.

  • Stata or R: Preserves variable types, labels, and metadata.

5. Version Control Logs

Maintain a version control log to track updates or changes to the data set over time. This log should include:

  • Dates and descriptions of changes (e.g., “Added variable for inflation-adjusted GDP”).
  • Information about added or removed variables.
  • Updates to derived variables or metadata.

Example Version Control Log

Date Change Description Author
2025-02-01 Added derived variable spending_gdp_ratio C. Kam
2025-02-08 Corrected coding issue for categorical variable A. Student

Both the README file and the version control log are best written and saved in plain text format (.txt). While plain text provides no formatting support (e.g., the capacity to write in different fonts or emphases), it can be opened and read by a very wide variety of operating systems and software packages. Tables, if they need to be constructed, can be rendered in markdown format (as is the table just above).

Summary

Comprehensive documentation ensures that data sets can be shared, understood, and reused effectively.




Part II – Diagnosing and Exploring Data




It’s tempting to run regressions as soon as your data set is complete: resist that urge! First, get familiar with your data.

Certainly, our end goal is often to estimate a regression of the form:

\[ Y = a + bX + e \]

However, while linear regression is robust, the adage “garbage in, garbage out” still applies. Before regressing \(Y\) on \(X\), scrutinize your data. Here’s a list of things to check for.

Key Checks for the Dependent Variable (\(Y\))

  • Censoring: Censored data have unknown values beyond a boundary—we know they exist but not their exact values. This can be by design (e.g., incomes reported as “>$200K”) or due to data limitations (e.g., pollution levels below a detectable threshold). When censoring occurs, we retain all observations but only have partial information on some values. Identifying whether censoring is systematic or random is critical for modeling.

  • Truncation & Clumping: Truncated data exclude observations beyond a boundary—we don’t just lose the value, we lose the entire case. For example, if a study only includes respondents earning under $100K, we have no data at all for higher earners. Truncation can cause clumping, where values accumulate near the cutoff.

  • Support & Gaps in Data: The range of \(Y\) is the set of all possible values it could take; the support of \(Y\) is the set of values it actually takes in the data. If \(Y\) has large gaps in its support—meaning certain values within the range never appear—it may indicate missing or underrepresented data. Ideally, \(Y\) has full support, meaning observations exist at the low end, high end, and throughout the middle of its range.

  • Limited Variation: If \(Y\) barely varies, there’s little to explain. For binary \(Y\), variance is \(p(1−p)\); extreme skew (e.g., 90% ones or zeros) is a red flag.

  • Linearity of Ordinal Data: If \(Y\) is an index of ordinal variables, does it behave like an interval variable? Check for smooth vs. abrupt changes.

  • Skewness & Re-scaling: If \(Y\) is highly skewed, consider whether it should be transformed (e.g., log or square root transformation) to facilitate interpretation and comparison across units. However, be mindful that some transformations introduce missingness–e.g., log transformations result in missing values for \(Y \leq 0\), which may require adjustments.

Key Checks for the Independent Variable (\(X\))

  • Variance Matters: \(X\) must vary; otherwise, it’s not a control variable—it’s a constant.

  • Clumping & Missingness: The consequences of clumping and missingness in \(X\) are less severe than in \(Y\) but can still distort analysis. Ideally, \(X\), like \(Y\), has full support, meaning observations exist at the low end, high end, and throughout the middle of its range. This helps us to avoid computing marginal effects or making linear predictions beyond \(X\)’s observed range.

  • Measurement Reliability: Watch for erratic changes in \(X\), especially in time-series data. Unexpected fluctuations suggest errors.

  • Collinearity: If multiple IVs are highly correlated, one or more may be redundant.

The next 5 chapter introduce techniques to help you execute these data checks. Note well, however: this check list is not meant to be applied mechanically. Some issues pose greater problems for certain data sets than others, and the severity of these issues depends on the nature of the data and the research question. Judgment is required to determine which checks warrant the most attention in any given analysis.




Chapter 5: Tables for Diagnosis, Description, and Analysis

Tables play a central role in data management and analysis, serving as key diagnostic devices to understand, clean, and verify data integrity. (You may have noticed how frequently I generate tables in my code chunks to see if a particular command worked as intended.) Whether you are looking for contradictions, missing values, or outliers, tables provide an efficient mechanism for identifying and addressing data quality issues. Tables also provide a convenient and compact way to describe aspects of discrete variables, and serve as the basis of many statistical analyses.

One-Way, Two-Way, and Three-way Tables

Tables come in different forms, depending on the purpose of your analysis:

  • One-way tables display aspects of a single variable, such as counts, means, or standard deviations.

  • Two-way tables (or contingency tables) show how two variables covary, with cell entries often representing frequencies or percentages.

  • Three-way tables show how two variables covary conditional on the value of a third variable. Three-way tables are closely related to the conditioning plots that we examine in Chapter 8.

In theory, you can extend tables into N-way configurations. However, the diagnostic process starts with one- and two-way tables.

Tables for Diagnosing Data Issues

Tables are indispensable for diagnosing common data problems such as:

  1. Detecting logical impossibilities and contradictions: Suppose you are working with demographic data. A simple one-way table of birth years might reveal an impossible value, such as an individual recorded with a negative age. This might suggest an error in the coding of missing values (which are commonly set to be negative).

  2. Identifying patterns of missingness: A two-way table can show whether missing values are clustered by certain groups or conditions. This can help determine if the missingness is random or systematic. Failing to harmonize variable labels before appending data sets, for example, can lead systematic patterns of missingness. A table of missingness by contributing data set can spot this sort of error.

  3. Locating outliers: Summarizing discrete variables through one-way tables can quickly highlight extreme or odd values. Tabulating respondents’ ages, for example, might show a person aged 108. This is not logically impossible–but it is highly unlikely! Such an observation should be flagged and examined directly.

Once you’ve used tables to identify inconsistencies or missing data, you can clean the data set by addressing those issues. This might involve correcting outliers, filling in missing values, or removing erroneous records.

Diagnostic Tables in Stata

I use Stata’s tab command to quickly generate a diagnostic contingency tables to spot contradictory cases and coding errors. The missing option to tab to obtain information on missingness in a variable or variables.

//Load Secret Ballot data set

clear
use "Secret Ballot.22.2.2025.dta"

//the codebook says these indicator variables should have 3 values: 0, 1, .

tab yes1839, miss
tab yes1840, miss
Yes vote at t Freq. Percent Cum.
0 5,481 18.55 18.55
1 4,883 16.53 35.08
6 235 0.80 35.88
. 18,945 64.12 100.00
Total 29,544 100.00

Something is awry in the yes1840 variable: 235 records are coded 6–and this should not be the case. These records should be examined.

The tab command is easily extended to two-way tables. The row option provides row percentages.


tab  SBposition68 ProAnti_pre1868 , miss row 

Loops to enhance data diagnosis and cleaning

If you examine Secret Ballot.22.2.2025.dta you will see that there are many indicator variables like yes1839 and yes1840. How can one check through them expeditiously? Here are some options.

//1. summary table of statistics, using wildcard(*) suffix to variable name
summarize yes18* 

//But summarize ignores missing values.

//2. a log and a loop:
  
  log using "data_check.log"

  foreach t in 1833 1835 1836 1837 {
    
    tab yes`t', miss
    
    }
    
 log off

//This code generates a one-way table for each variable, and posts the tables to a text file, "data_check.log". 

Stata has a dedicated table-based utility, misstable to track missingness.


misstable summarize LF_1865 cast_1865 pctLF_1865
Variable Obs=. Obs>. Obs<. Unique values Min Max
LF_1865 478 29,066 152 -99 4660
cast_1865 478 29,066 203 -99 50595

(Recall that . is Stata’s missing values indicator, and that . is set to a huge number. So Obs <. indicate observed values less than this huge number.)

The output shows that LF_1865 and cast_1865 both contain 478 missing values, but pctLF_1865 has none–hence it does not appear in the misstable. This is odd because the code book tells me that pctLF_1865 = (LF_1865/cast_1865)*100.

The advantage of misstable is that it shows the coincidence of missing values without presenting the range of observed values. This means that misstable output is concise. By contrast, a tabulation of a continuous variable may be too verbose to be helpful.

Diagnostic Tables in R

R’s tabyl() command from the janitor library is an efficient tool for producing simple tables. tabyl() automatically calculates frequencies and proportions, making it ideal for quick diagnostics. This code chunk repeats the CES example from above using tabyl().

library(janitor)
library(dplyr)
library(haven)
Ballot <- read_dta("Secret Ballot.22.2.2025.dta")


  tabyl(Ballot$yes1840) # A 1-way tabyl()
  
  tabyl(Ballot, SBposition68, ProAnti_pre1868) # A 2-way tabyl()

Combining the is.na operator with tabyl() allows us to assess the coincidence of missingness between 2 variables.

#A missing values contingency table
library(janitor)
library(dplyr)

#A 3-way missing values contingency table, conditioned on values of contested_1865
Ballot %>%
  mutate(
   pctLF_1865_missing = is.na(pctLF_1865),
    LF_1865_missing = is.na(LF_1865)
  ) %>%
  tabyl(pctLF_1865_missing, LF_1865_missing, contested_1865) 


# Alternatively:
miss_table %>% count(pctLF_1865_missing = is.na(pctLF_1865), LF_1865_missing = is.na(LF_1865))

Advice for Using Tables to Clean Data

  • Start simple: Begin with one-way tables to get a general overview.

  • Look for patterns: Use two-way tables to see if errors or missing data are associated with particular subgroups.

Marginal and Joint Distributions

Understanding the marginal and joint distributions of a table is essential for interpreting relationships between variables.

  • Marginal distributions provide summaries for individual variables by summing or averaging over the table’s rows or columns. For example, the marginal distribution of a variable in a contingency table shows the overall frequency or proportion associated with each category.

  • Joint distributions reveal how combinations of variable values co-occur within the table. Each cell of a two-way table shows the joint distribution of the corresponding values of the two variables.

Interpreting marginal and joint distributions together helps identify patterns such as correlations or dependencies between variables.

Example: Consider the following table showing MPs’ support or opposition to the adoption of the secret ballot on motions presented to the House of Commons, 1834-1868. (The Stata code that produced the table follows later in the chapter.)

Table 5.1: MPs' Positions on the Adoption of the Secret Ballot in the UK House of Commons, 1834-1868.

  Position on Secret Ballot
  Anti Pro Total
2-Party Affilation
Conservative   924 19 943
  (98.0) (2.0) (100.0)
 
Liberal   350 736 1,086
  (32.2) (67.8) (100.0)
 
Total   1,274 755
  (62.8) (37.2)

Figures in parentheses are row percentages.

Notes: MPs' positions reflect their recorded votes on 22 motions on the secret ballot put before the House of Commons between 1834-1870. MPs who voted at least twice for (against) the secret ballot are recorded as pro- (anti). The 696 MPs who served in the House when these motions were presented, but who did not cast a vote on these motions, are omitted from the table. Party affiliation is assigned on the basis of Rallings and Thrasher (2007).

  • The marginal distribution is obtained by summing across rows or down columns. Thus the row marginals tell us that 950 Conservative MPs and 1093 Liberal voted on the secret ballot motions submitted to the House between 1834 and 1868. The column marginals tell us that 1286 MPs were anti-secret ballot, and 757 MPs were pro secret ballot.

  • The joint distribution is shown by the cells in the interior of the table. For instance, 931 Conservative MPs–98% of all Conservatives who voted on these motions–opposed the secret ballot. By contrast, 738 Liberal MPs–67.5% of all Liberals who voted on these motions–supported the secret ballot. However, a substantial minority of 355 Liberal MPs were anti-secret ballot. In this manner, the joint distribution reveals the distribution of pro- and anti-ballot sentiment across and within each party: we see that the Conservatives were almost uniformly against the secret ballot, whereas the Liberals, while generally supportive of the secret ballot, were internally split on the issue.

Presenting Tables to Describe Data

Let’s revisit the table above on British MPs’ votes regarding the secret ballot to consider the statistical narrative that should accompany any table presented in the main text of a paper.

  • Tables do not explain themselves. A table without context is just numbers on a page. Always guide the reader by providing two essential forms of context:

    1. Explain how to read the table. Clearly define what the rows, columns, and cells represent—whether they contain counts, averages, or other measures. Ensure that the reader understands what the table is displaying before drawing any conclusions.

    2. Interpret the table’s significance. Highlight key patterns or contrasts in the data and explain how they relate to the broader arguments of the paper. What does the table reveal? Does it support or challenge an existing interpretation?

  • Give the table an informative title. A clear, specific title allows the reader to grasp its purpose at a glance.

  • Use a note(s) to provide additional explanatory information. A brief note accompanying the table can clarify any complexities in the data. For example, my original note explained how MPs were categorized as pro- or anti-ballot, that those who did not participate in any motions on the secret ballot were excluded, and how party affiliation was determined. Such notes are best positioned directly below the table so that the reader can see that the note applies to the table rather than to any other text or figures on the page.

  • Keep explanation and interpretation distinct.

    • Explanation should be factual and objective—a neutral account of what the table presents and how to read its contents.

    • Interpretation should come only after explanation and should focus on why the data matter. This is where you connect the table’s content to the broader argument of your paper. Does the data support a hypothesis? Does it challenge a conventional understanding?

Do not intermix explanation and interpretation; doing so confuses the reader and obscures the logic of your argument. By first explaining the table and then interpreting its significance, you ensure clarity and strengthen your analytical narrative.

Table-building Rules

Tables are easier to explain and interpret when they are properly built. Schwabish (2021, 344) offers some concrete guidelines for constructing tables, several of which I repeat here:

A minimalist design ensures that the aesthetic features of the table do not distract the reader from the contents of the table. To that end:

  • Limit the use of guidelines and dividers.

  • Use white-space rather than guidelines to separate columns and rows

  • Remove and do not unnecessarily repeat units / denominations. For example, if the cells represent percentages, you should not enter 14%, 87%, 61%, etc.; rather you should write a note to the table that tells the reader that cell contents are percentages.

  • Do not clutter table cells with unnecessary precision. Political science is not like physics or medicine in that our data rarely require more than two significant digits.

In addition, some basic formatting strategies help to make tables easier to interpret:

  • right-align numerical entries in a font that ensure that decimals and thousands commas line up. This makes it easier for the reader to identify changes in magnitude between cells at a glance.

  • Offset row and column headers from the cell contents. Sometimes this can be accomplished simply by left- or center-aligning header text. In other instances, you may want to add a blank row or column to the table to create this separation.

Tables for Analysis

Marginal Distributions and Differences-in-Differences

The differences-in-differences (DiD) estimator can be understood as a contingency table, where the row marginals are computed as the difference between row cells, and the column marginals as the difference between column cells. Consider the following two-way table, which examines the effect of a policy intervention on test scores, with the subgroups being treated vs. control schools and pre- vs. post-intervention periods. The row marginals capture the pre- / post- variation in outcomes; the column marginals capture the treatment / control variation in outcomes.

Difference-in-Differences (DiD) Analysis

Group Pre-Intervention Post-Intervention Difference (Post - Pre)
Treated Schools 70 85 +15
Control Schools 68 70 +2
Difference (Treated - Control) +2 +15 +13

Calculating the DiD Estimator:

DiD = (Post - Pre for Treated) - (Post - Pre for Control)

DiD = (85 - 70) - (70 - 68) = 13

This table demonstrates how the difference-in-differences approach isolates the impact of the policy by controlling for baseline differences between the treated and control groups.

Most of the time, you will estimate differences-in-differences via a linear regression model:

\(\text{y}_{it} = \alpha + \beta \text{PrePost}_{t} + \delta \text{Treated}_{i} + \gamma \text{PrePost} \times \text{Treated}_{it}\)

where \({PrePost}_{t}\) is a dummy that is 0 in the pre-intervention period, and 1 in the post-intervention period, \({Treated}_{i}\) is a dummy that is 0 for the control group and 1 for the treatment group, and in the pre-intervention period, and 1 in the post-intervention period, and \(\text{PrePost} \times \text{Treated}_{it}\) is an interaction between the two dummies. The table that I presented above should help you better appreciate why the difference-in-differences estimate is measured by the \(\gamma\) coefficient.

Another useful feature of the contingency table perspective on the difference-in-differences estimator is that one can construct such a table from basic aggregate data using a simple spreadsheet. For example, assume that we obtain motor vehicle accident rates for the 4 Western Canadian provinces after some hypothetical changes to road safety measures in SK and MB after 2021. The data are as follows:

Province, Year, Accident Rate (per 100K), Treated, Post-Treatment
BC, 2019, 245, 0, 0
BC, 2020, 240, 0, 0
BC, 2021, 235, 0, 1
BC, 2022, 245, 0, 1
AB, 2019, 265, 0, 0
AB, 2020, 260, 0, 0
AB, 2021, 255, 0, 1
AB, 2022, 260, 0, 1
SK, 2019, 255, 1, 0
SK, 2020, 250, 1, 0
SK, 2021, 225, 1, 1
SK, 2022, 205, 1, 1
MB, 2019, 225, 1, 0
MB, 2020, 220, 1, 0
MB, 2021, 190, 1, 1
MB, 2022, 180, 1, 1

These data can be copied into Sheets, and the differences-in-differences estimator computed with a pivot table. To do so, follow these steps:

Exercise: Creating a Pivot Table in Google Sheets to Compute the DiD Estimator
# Step 1: Select the Data
1. Open Google Sheets and paste the data above.
2. Highlight the data & apply Data → split text to columns
3. Highlight all columns, including headers.

# Step 2: Insert a Pivot Table
1. Click Insert → Pivot Table.
2. Choose New Sheet → Click Create.

# Step 3: Configure the Pivot Table
1. Rows: Add `"Treated"`; uncheck Show Totals
2. Columns: Add `"Post-Treatment"`;uncheck Show Totals
3. Values: Add `"Accident Rate (per 100K)"`, change to AVERAGE.

# Step 4: Compute the DiD Estimator
From the pivot table:

DiD = Y2 - Y1 - X2 - X1

Where:
- Y1: Avg. accident rate for Treated (Pre-2021).
- Y2: Avg. accident rate for Treated (Post-2021).
- X1: Avg. accident rate for Control (Pre-2021).
- X2: Avg. accident rate for Control (Post-2021).

Publication-Ready Tables in R and Stata

Creating publication-ready tables is time-consuming. And while Stata’s tab and R’s tabyl() commands are great for quickly generating simple tables for data cleaning and exploration, they lack many built-in features needed for polished outputs, such as titles, footnotes, and annotations. Both Stata and R offer alternative tools for producing high-quality tables that can be easily exported to a variety of formats.

Stata’s table Command for Flexible Tables

I show two Stata utilities for exporting tables. The first is asdoc, an ado.file–a user-written program, like R’s libraries; download asdoc by typing ssc install asdoc. The asdoc wrapper offers an straightforward way to export Stata tables to word and rich text documents. Run the command below (changing the directory to fit your machine), and inspect the resulting table. It’s pretty polished, but you still need to amend and add some elements.

//Load Secret Ballot data set

clear
use "Secret Ballot.22.2.2025.dta"

//asdoc: a simple way to export tables to word, rich-text documents

asdoc tab TwoParty ProAnti_pre1868 if first_rec==1, row save(Stata_Table_Example.doc)

When used in conjunction with the collect suite of commands, Stata’s table command offers a highly customizable way to for producing publication-ready tables. Here’s the code that created Table 5.1 above; it employs table in conjunction with Stata’s collect suite of table-making commands.

//Load Secret Ballot data set

clear
use "Secret Ballot.22.2.2025.dta"

//collect: a suite of commands to customize and export tables.

collect clear

collect create cont_table

// Run this in a do-file, not via the command line

table TwoParty ProAnti_pre1868 if first_rec==1 & ProAnti_pre1868~=0, ///
    stat(frequency) ///
    statistic(percent, across(ProAnti_pre1868) ) ///  if you want column or total %, adjust this 
    totals(ProAnti_pre1868 TwoParty) ///
    style(table-tab2)

//formatting cells
collect style cell result[freq], sformat("%s")  
collect style cell result[percent], nformat(%4.1f) sformat("(%s)") 

collect title "Table 5.1: MPs' Positions on the Adoption of the Secret Ballot in the UK House of Commons, 1834-1868."

collect notes 1: "Figures in parentheses are row percentages."
collect notes 2: "Notes: MPs' positions reflect their recorded votes on 22 motions on the secret ballot put before the House of Commons between 1834-1870. MPs who voted at least twice for (against) the secret ballot are recorded as pro- (anti). The 696 MPs who served in the House when these motions were presented, but who did not cast a vote on these motions, are omitted from the table. Party affiliation is assigned on the basis of Rallings and Thrasher (2007)."

//Tidy appearance
collect style cell cell_type[item column-header], halign(center)
collect style cell border_block, border(right, pattern(nil))

collect preview


//can export to word, excel, LateX, pdf, html, markdown; change path as necessary

collect export "Contingency.html", as(html) replace

Tables in R

Creating Tables Using dplyr’s tibble()

The dplyr package’s tibble() is another useful tool when organizing summary statistics in a tabular form. Tables of summary statistics do not always appear in the main text of a paper, but they are essential elements of a comprehensive appendix.

library(haven)
library(dplyr)

Ballot <- read_dta("Secret Ballot.22.2.2025.dta")

Ballot <- Ballot %>% 
  mutate(cast_1868 = na_if(cast_1868, -99))

summary_table <- tibble(
    Mean = mean(Ballot$cast_1868, na.rm = TRUE),
    SD = sd(Ballot$cast_1868, na.rm = TRUE),
    Min = min(Ballot$cast_1868, na.rm = TRUE),
    Max = max(Ballot$cast_1868, na.rm = TRUE),
   N = sum(!is.na(Ballot$cast_1868))
)

print(summary_table)

I find that tibble is useful mainly for diagnostic purposes. If I have to generate publication-ready tables in R, I use the flextable() library:

# Load required libraries & data
library(haven)
library(dplyr)
library(flextable)
Ballot <- read_dta("Secret Ballot.22.2.2025.dta")

# Set -99 values to NA
Ballot <- Ballot %>%
    mutate(across(c(cast_1868, LF_1868), ~na_if(., -99)))

# Calculate summary statistics for two variables
summary_table <- Ballot %>%
    summarise(
        Variable = c("cast_1868", "LF_1868"),
        Mean = c(mean(cast_1868, na.rm = TRUE), mean(LF_1868, na.rm = TRUE)),
        SD = c(sd(cast_1868, na.rm = TRUE), sd(LF_1868, na.rm = TRUE)),
        Min = c(min(cast_1868, na.rm = TRUE), min(LF_1868, na.rm = TRUE)),
        Max = c(max(cast_1868, na.rm = TRUE), max(LF_1868, na.rm = TRUE)),
        N = c(sum(!is.na(Ballot$cast_1868)), sum(!is.na(Ballot$LF_1868)))
    )
#You could just print the above output--but it's not compact
 print(summary_table)
 
# Create a publication-ready flextable
ft_example <- flextable(summary_table) %>%
    set_header_labels(
        Variable = "Variable",
        Mean = "Mean",
        SD = "Standard Deviation",
        Min = "Minimum",
        Max = "Maximum",
        N = "N"
    ) %>%
    theme_vanilla() %>%
    autofit() %>%
    bold(part = "header") %>%
    align(align = "center", part = "all")

print(ft_example)

# use officer library to export to pdf etc.
library(officer)
# Export the flextable to a PDF
doc <- read_docx() %>%
    body_add_flextable(ft_example) 

# This generates a word document that's on par with stata's asdoc

print(doc, target = "summary_table.docx")

Another promising library for descriptive tables in R is table. Read about it on this blog until I have a chance to craft a political science example.

# Load required libraries & data
library(haven)
library(dplyr)
library(table1)
Ballot <- read_dta("Secret Ballot.22.2.2025.dta")

# Set -99 values to NA
Ballot <- Ballot %>%
    mutate(across(c(cast_1868, LF_1868), ~na_if(., -99)))

table1(~ cast_1868 + LF_1868, data = Ballot)

Summary

Tables are basic diagnostic devices in the data management process. One-way tables show outliers and missingness; two-way tables show contradictions and missingness patterns. Tables are also fundamental tools for describing your data–but they do not explain themselves. Even well-designed tables, accompanied by informative titles and notes must be explained and interpreted. Learning how to craft and export publication-ready tables will smooth your workflow and greatly improve your productivity.




Chapter 6: Univariate Graphics

Introduction

This chapter focuses on the statistical description of a single variable using graphical methods. We will explore:

  • The rationale for using graphics instead of tables.

  • Different types of univariate graphics: bar charts, dot plots, box plots, histograms, kernel density estimates, and quantile plots.

  • Application of these methods to an example data set to provide context and insights for data projects.

Why Use Graphics Instead of Tables?

Tables can become ineffective in certain scenarios:

  • When the number of categories (N) exceeds 5-9. At some point, the number of rows or columns in the table overwhelms a person’s capacity to absorb information. Miller’s maxim (1956) would suggest that that number is about 7, plus or minus 2.

  • When the variable of interest is continuous.

  • When the underlying data feature a lot of outliers or gaps. Summary tables of means, even when accompanied by standard deviations, can miss these aspects of the data.

Graphics, on the other hand, can convey essential moments of a variable’s distribution at a glance:

  1. First Moment: Expected value or mean of the variable.

  2. Second Moment: The variance (hence standard deviation) of the variable.

  3. Third Moment: Skewness (asymmetry of the distribution).

  4. Fourth Moment: Kurtosis (thickness of tails).

Skewness and Kurtosis

In quantitative political science, we work with means and variances all the time, and I expect that you have a firm grasp on what those two statistics mean and how they are computed. By contrast, we do not work as frequently with skewness and kurtosis, and so it so it may help to review what those two statistics measure.

  • Skewness refers to asymmetry in the probability of extreme outliers; it can lead to discrepancies between the mean and median. A distribution is skewed in the direction of the long tail. Hence, relative to a symmetric distribution, a left-skewed distribution has a greater likelihood of generating values well below the distribution’s mean. A right-skewed distribution has a greater likelihood of generating values well above the distribution’s mean

  • Kurtosis determines the probability of outliers. A normal distribution has a kurtosis of 0. A leptokurtic distribution has positive kurtosis (i.e., > 0), implying that it has thicker tails than a normal distribution, and hence a higher likelihood of producing extreme values. By contrast, A platykurtic distribution has negative kurtosis (i.e., < 0), implying that it has thinner tails than a normal distribution, and hence a lower likelihood of producing extreme values.

Broad Principles of Graphics

When creating graphics, our goal is to provide an efficient summary of the underlying data without introducing distortion or unnecessary distractions. A good graphic should highlight key insights, such as central tendencies, variability, and any notable patterns, while avoiding excessive visual elements that detract from interpretation.

The Information-to-Ink Ratio Explained

The information-to-ink ratio, a concept introduced by Edward Tufte, measures how efficiently a graphic uses ink to convey meaningful information. The points is to convey as much information as possible with as little “ink”, i.e., graphical detail. The principle emphasizes minimizing unnecessary visual elements while maximizing the presentation of key insights into the data.

A high information-to-ink ratio:

  1. Enhances Readability: Simplified visuals allow readers to easily focus on important patterns and trends.

  2. Clarifies and Eases Communication: Reduces cognitive load and eliminates distractions, allowing viewers to grasp key insights quickly.

  3. Improves Comparisons: Clean visuals highlight differences and similarities among data points, enabling effective comparisons.

For example, bar charts often have a poor information-to-ink ratio due to the use of filled bars that add visual clutter without adding any additional information. Alternatives, like dot plots (see below), improve this ratio by reducing excess ink and providing a cleaner visual representation.

Understanding Occlusion and Avoiding It

Occlusion occurs when parts of a graphic obscure or hide underlying information, making it harder for viewers to interpret key data accurately. This often happens when overlapping elements or thick, solid visual features block important details.

To avoid occlusion:

  • Use transparency or lighter shading to allow layers to remain visible.

  • Opt for side-by-side comparisons when overlaying multiple data sets would create too much clutter.

  • Select minimalist visual designs (e.g., lines, dots) to prevent one part of the graphic from overwhelming another.

By adhering to these principles, students can create visuals that are both informative and aesthetically efficient.

Consistent and Comprehensive Labeling

Graphs should include comprehensive labels to help viewers interpret the data correctly:

  • Informative Title: Every graph should have a clear and descriptive title summarizing the key insight it provides.

  • Labeled Axes: Both axes must be labeled to indicate what is being graphed and the units of measurement used.

  • Descriptive Notes: A brief note should explain how to interpret a typical data point in direct, descriptive terms. For example, instead of theoretical insights, it should convey what a specific value represents in terms of the variable being measured.

Effective labeling enhances understanding and ensures that the graphic can stand on its own without requiring extensive external explanation.

Visualizing Data Effectively

When creating graphics, the primary goal is to provide a clear and efficient summary of the underlying data without introducing distortion or unnecessary distractions. A well-designed graphic should emphasize key insights, such as central tendencies, variability, and notable patterns, while avoiding excessive visual elements that might hinder interpretation. By adhering to these principles, we can more effectively evaluate how graphical methods apply to specific data sets.

To illustrate these concepts, I use the data set from Kostelka and Blais (2021) on electoral turnout. The first part of this chapter demonstrates a range of univariate graphics, focusing the sound application of graphical principles. These principles include maximizing the information-to-ink ratio, avoiding occlusion through minimalist design, and adding informative titles and labels. I also discuss which types of graphs are best suited to different kinds of data.

The latter half of the chapter provides the code that I use to generate the graphics in the chapter. I present both the Stata and R code needed to replicate my graphics. I am especially impressed with the structured syntax of R’s ggplot2; it provides an intuitive framework for generating insightful graphics.

Bar Charts as a Starting Point

Bar charts are a familiar graphical representation using the height of bars to indicate quantities of interest. They are often used to visualize categorical data, but they have two main weaknesses:

  1. Sub-optimal Information-to-Ink Ratio: Bar charts often waste ink by shading or filling bars, which can obscure the clarity of the data without adding value.

  2. Limited Display of Within-Category Variation: Bar charts primarily represent aggregate data, making it difficult to see variations within each category.

Practical Example: Four Bar Charts of Electoral Turnout: I present four bar charts to illustrate the evolution of my approach to effective data visualization.

The first chart is a disaster! Not only does the excessive shading add no informational value, but the categories are also ordered alphabetically by country—an arrangement with no logical connection to the data being represented. As a result, the country labels are a mess, overlapping and rendering the text nearly illegible.

The second attempt, displayed at the bottom left of the panel, is an improvement. This time, I order the countries by electoral turnout, with turnout increasing from left to right. This ordering introduces a logical flow, inviting the viewer to explore potential patterns—such as whether countries with low turnout share certain characteristics and how they differ systematically from those with high turnout. However, this version still suffers from occluded labels, non-informative shading, and the absence of a title.

The third chart, in the top-right corner, is far more polished. I add a clear and informative title, reduce the intensity of the bar shading to enhance the information-to-ink ratio, and angle and shrink the labels to eliminate overlap. I also include a source note to provide proper context for the data.

The fourth and final chart, at the bottom left, adopts a horizontal layout. This orientation significantly improves the readability of the country labels. Additionally, I narrow the width of the bars, as their thickness does not convey any extra information.

Even this final version is not perfect. Notice how most of the chart–specifically, the large area below 40 percent turnout–contributes little to the viewer’s understanding. This empty space consumes visual real estate and distracts from the key insights without offering meaningful information about the average level of turnout across countries or the variance between them. There are better alternatives to explore.

Dot Plots as an Alternative to the Bar Chart

Dot plots are good alternatives to bar charts. The advantages of the dot plot are twofold:

  1. Dot plots eliminate the bars and use dots to indicate quantities, thereby improving the information-to-ink ratio.

  2. Dot plots can better highlight cross-case variation and provide a cleaner visual representation.

Practical Example: Dot Plots Instead of Bar Charts: The next three graphs show exactly the same data as above, but in the form of dot plots rather than bar graphs. Again, a series is show to illustrate how I refine the plots on the basis of graphical principles

The first plot is already an improvement over the bar chart: removing the bars removes a lot of uninformative ink without subtracting any information. Nonetheless, the plot is too rough and unfinished to present in the final draft of any work.

The second dot plot incorporates several key refinements. I add a clear and informative title, rank the countries in descending order of turnout, and adjust the y-axis scale to use 10 percent intervals—an approach I believe is necessary to highlight differences among countries. I also reduce the y-axis grid lines to near invisibility by minimizing their spacing and intensity, though I may have gone a bit too far in my effort to reduce unnecessary ink. To draw the reader’s attention, I shade the dots, enhancing their visual impact.

However, despite these improvements, a significant issue remains: a large expanse of empty, uninformative space persists on the left side of the graph, particularly below the 40 percent turnout mark. This area adds no value and distracts from the key insights the graph aims to convey.

The final plot in this series introduces several important enhancements to the previous version:

  1. Adjusting the Y-Axis Scale: The most significant change is modifying the y-axis to start at 40 percent. This eliminates the large, unused space on the left, effectively addressing a major shortcoming of the earlier plot.

  2. Scaling Back Ink Reduction: I increase the intensity of the y-axis grid lines, reversing my overly enthusiastic elimination of ink in the previous iteration. This restores a better balance between minimalism and functionality, making it easier for the viewer to link countries with their respective turnout levels.

  3. Tweaking the Aspect Ratio: I subtly reduce the plot’s aspect ratio from 1.0 to 0.60. The aspect ratio refers to the width-to-height ratio of the graph area. This change stretches the plot horizontally, highlighting three distinct groupings of countries: those with very low turnout, those with moderate turnout, and those with high turnout.

  4. Adding Reference Lines: I include two reference lines, using a color-blind-friendly palette, to mark the boundaries between the three groups.

This final version achieves two key objectives:

  • Accentuating Patterns: The dot plot reveals turnout groupings far more effectively than the previous bar chart, emphasizing how visualization choices influence data interpretation.

  • Supporting a Statistical Narrative: By presenting the data in this manner, the plot not only displays information but also encourages a specific interpretation, linking the visualization to a broader statistical story.

Box Plots to Show Within-Case Distribution

One obvious weakness of the dot plots that I just presented is that they do not show information on within-case variation. Two countries might have the same mean level of turnout, but turnout might fluctuate widely around that mean in one country and hardly at all in the other. Box plots provide a way to visualize the distribution within categories:

  • The “box” represents the interquartile range (IQR), where 50% of observations lie.

  • The “whiskers” can be customized to represent different ranges (e.g., 95% confidence intervals).

Box plots effectively show within-case variation and highlight potential outliers.

Practical Example: Using a Box Plot to Highlight Within-Country Variance

The example below presents the mean turnout data using a horizontal box plot. (A vertical orientation typically works well for time-series data, where time is displayed on the x-axis.)

The within-country variation shown in the box plot challenges the threefold grouping suggested by the previous dot plot. I retained the reference lines from the earlier plot to highlight how the dot plot created the impression of distinct boundaries between groups.

Histograms as Another Familiar Option

Histograms offer a superior method to tables for visualizing the frequency distribution of continuous variables. They rely on binning, where data is grouped into intervals, introducing two key trade-offs:

Trade-offs in histograms

1. Equal Bins vs. Equal Widths

Should we use:

  • Equal Bins: Create \(K\) bins with an equal share of observations (\(N/K\)) but varying widths?

  • Equal Widths: Set fixed-width bins (e.g., \(K\) units each) but with potentially uneven observation counts?

Example:

Consider 100 observations of a variable \(x\) ranging from 0 to 100.

  • Equal Bins Strategy: Order observations and allocate the lowest 10 to the first bin, the next 10 to the second, and so on. However, if \(x\) is highly right-skewed, with only 10 observations where \(x > 50\), the 10th bin might span from 50 to 100—a width of 50.

  • Equal Widths Strategy: Divide \(x\) into 10 equal-width bins (0-10, 11-20, … 91-100). Yet, with only 10 observations above 50, the upper bins might hold just 1-2 observations—or be empty.

2. Number of Bins

  • More Bins: Provide a detailed view of the variable’s probability distribution function (PDF) but risk distortion if many bins have sparse observations.

  • Fewer Bins: Increase robustness by ensuring each bin holds many observations but reduce informativeness as \(K \rightarrow 1\).

There is no definitive answer to these trade-offs—it’s a matter of judgment in light of the data at hand. The key is to make these choices consciously and explicitly based on the data. Otherwise, your software will make them for you—implicitly and without consideration of the data’s nuances!

Practical Example: Number of Bins and Informativeness

The four histograms below show how changing the number of bins alters the amount of information that is conveyed. If too few bins are used (bottom-left), the histogram contains little information. Employ too many bins (bottom-right), however, and the histogram is cluttered with detail. Hence, we want to use only as many bins as are necessary to summarize the data effectively. That “sweet spot” number varies depending on the data. You must use judgement.

Histograms suffer from many of the same problems that afflict bar charts:

  • Like bar charts, histograms suffer from occlusion (i.e., it may hide or obscure information).

  • Because the bars are solid, we cannot layer one histogram over another to compare distributions directly. Even if we use wire-frame bars, we end up with so much occlusion that the graphic is indecipherable. Thus, we’re forced to compare on a side-by-side basis. This is sub-optimal.

  • Imposing an equal number of bins on cases does not necessarily improve side-by-side comparison because a number of bins (or bin width) that works for one case may work poorly for another.

Practical Example: The Bin Dilemma and Occlusion Undercut Comparison

This example brings all these problems to a head: Can you tell if New Zealand’s turnout distribution is fundamentally different than the United Kingdom’s? Or if the United Kingdom’s is fundamentally different than Sweden’s?

Kernel Density Estimation

Kernel density estimation (KDE) provides a smoothed representation of the distribution, addressing some limitations of histograms:

The smoothing helps to ease (but not eliminate) the trade-offs implicit in the histogram’s binning procedure

The basic idea is as follows:

  1. Construct a window centered around an fixed observation, \(x_0\). This window can denominated as a set “bandwidth” or to contain \(m\) “nearest neighbours” to \(x_0\).

  2. Compute a weighted average of the \(x_i\) in the window.

  3. Shift the window over by one observation – so the windows overlap.

  • KDE computes a weighted average within a moving window around each observation. The kernel choice determines the weights assigned to observations in the window.

  • Bandwidth selection is crucial: too small a bandwidth leads to noisy estimates,but too large a bandwidth oversmoothes the data.

Understanding the Mechanics of KDE

Common kernels used in KDE include the uniform or rectangular kernel, which weights all observations within the window equally and assigns zero weight to observations outside the window. Other kernels may give higher weight to observations closer to the center. For example, a triangular kernel places maximum weight on the observation(s) at the very center of the window; the weights then decline symmetrically as one moves to the edges of the moving window.

Advantages of KDE Relative to Histograms
  • At least in the middle of the distribution, KDE reduces concerns about the influence of histogram bins with only 1-2 idiosyncratic observations. This is because the overlapping windows and weighted averaging help to smooth out local irregularities.

  • One can overlay multiple KDEs for comparison without occlusion. This is not possible with histograms.

Practical Example: Overlaid KDEs Facilitate Comparison

This example demonstrates how we can overlay multiple KDEs with minimal occlusion. Comparing distributions in this manner is far easier and more effective than the side-by-side comparisons required by histograms. The overlay reveals a clear 10-point decrease in electoral turnout during the 1991-2017 period.

Trade-offs in KDE

But there is no free lunch:

  • At the ends of the distribution, the window is truncated, hence outliers at the ends of the distribution can distort our estimates.

  • We still face a signal-to-noise problem:

    • If bandwidth is too small, or too few nearest neighbors are included in the moving window, local averages are noisy (i.e., high variance).

    • If bandwidth is too big, or too many nearest neighbors are included in the moving window, local averages become little different from the sample average.

Software will have a default that “optimizes” this choice, but Stata’s manual warns that its default bandwidth may be too wide for multi-modal or skewed distributions. R’s documentation is disturbing silent on these issues.

In general, however, bandwidth selection is more important than the kernel choice.

Quantile Plots

Quantile plots provide a comprehensive visualization of a variable’s distribution by mapping ordered values to their corresponding cumulative probabilities. They are particularly useful for identifying patterns in the data, detecting irregularities like gaps, and assessing distributional characteristics without requiring arbitrary choices like bin widths in histograms or kernel selection in density estimation.

Conceptual Construction of a Quantile Plot

A quantile is a synonym for a percentile and represents a value below which a given fraction of observations fall. Constructing a quantile plot involves three conceptual steps:

  1. Arranging the Data: The data is sorted from the minimum to the maximum observed value.

  2. Calculating Quantiles: For each data point, the fraction of observations that are less than or equal to it is determined. This fraction represents the cumulative probability, and the data value itself is the quantile.

  3. Plotting the Data: The quantiles are plotted on the vertical axis against their corresponding cumulative probabilities on the horizontal axis.

As Cleveland (1993, 17-18) defines it:

“The f quantile, q(f), of a set of data is a value along the measurement scale of the data with the property that approximately a fraction f of the data are less than or equal to q(f).”

Informally, you can think of a quantile plot as the inverse of a cumulative distribution function (CDF). A CDF shows the probability that a realization of a random variable will be strictly less than a given value. For example, when rolling a fair die, the probability of obtaining a result strictly less than 3 is \(2/6\), since rolling a 1 has a probability of \(1/6\) and rolling a 2 also has a probability of \(1/6\). This means that if we roll the die many times, we would expect approximately one-third of the outcomes to be either 1 or 2. By contrast, a quantile plot reveals the value of the random variable below which a given fraction of the data falls. In the case of a series of die rolls, to continue the example, a quantile plot would indicate that the value below which one-third of the outcomes lie is 3.

To some extent, the difference between a CDF plot and a quantile plot is cosmetic: in a CDF plot, the x-axis shows values of the variable, and the y-axis shows the associated cumulative probabilities. In a quantile plot, these axes are reversed—the x-axis shows cumulative probabilities, while the y-axis shows values of the variable.

Advantages of Quantile Plots

A quantile plot provides an intuitive view of a variable’s distribution by mapping how observed values are spread across the range of observations. Quantile plots have a number of distinct strengths.

  • Exceptional Gap Detection: Unlike histograms and KDE, which can obscure sparsity in certain ranges due to smoothing or binning, quantile plots provide a direct and unobstructed view of where data is missing or unevenly distributed.

  • Scalability Across Sample Sizes: Quantile plots remain effective regardless of data set size, providing meaningful insights even when the number of observations is small.

  • Minimal Arbitrary Choices: Unlike histograms that require bin width selection or KDE that depends on smoothing parameters, quantile plots use the raw data structure without additional assumptions, preserving interpretability.

  • Versatility in Data Types: They can be used with ordinal or binary variables, although their application to nominal variables depends on numerical coding.

By making data gaps visible and avoiding the pitfalls of traditional density estimation methods, quantile plots serve as an invaluable tool in exploratory data analysis. Their flexibility and ability to reveal subtle distributional features make them particularly useful for assessing data quality before formal modeling.

Practical Example: Using Quantile Plots to Compare Turnout Distributions Over Time

A quantile plot provides a powerful way to compare distributions by showing how values are distributed across different quantiles. Here’s how to read it using this example and the insights it reveals:

How to Read a Quantile Plot

Recall that a quantile plot arranges data points along the x-axis, representing cumulative probabilities from 0 to 1, while the y-axis displays the corresponding data values—in this case, electoral turnout percentages. By focusing on a specific quantile on the x-axis, we can quickly determine what proportion of the data falls above or below a particular turnout threshold.

For example, in the 1945-1969 and 1970-1990 periods, tracing horizontally at 80 percent turnout shows the data intersecting the x-axis at 0.5. This indicates that 80 percent turnout was the median—with half of all elections surpassing this level.

In contrast, in the post-1990 era, the same exercise reveals the data intersecting the x-axis at 0.75, implying that an 80 percent turnout in that era was at the 75th percentile of the distribution. This underscores how high turnout has become increasingly rare over time: only the top 25 percent of elections in the 1991-2017 period reached or exceeded this level.

This example highlights how quantile plots enable effective comparisons even within the side-by-side format that hamstrung histograms. The three quantile plots also show two other key features of the data:

  1. High Turnout Becomes Rare: The upper tail of the distribution (i.e., high turnout elections) is longer and thinner in the post-1990 era, showing that high turnout elections are now exceptional events.

  2. Turnout Gap Emergence: During the 1970-1990 period, there is a clear gap in the data—elections with turnout in the low-to-mid 60 percent range are nearly absent. This gap persists in the post-1990 era, but shifts lower, signaling a broader decline in turnout levels.

In the next chapter, we’ll explore an adaptation of the quantile plot that addresses the limitations of side-by-side comparisons and further reduces occlusion.

Enhancing Quantile Plots

One can enhance a quantile plot by adding a histogram to the plot’s margin. This addition means the graph informs the reader about both the variable’s cumulative distribution (quantile plot) and its probability distribution (the histogram). You can find an article on this technique and the value of it in this piece in the Stata Journal.

The example below is not perfect: there’s a touch of occlusion between the histogram and the quantile markers at 0, and the technique would pair better with a more finely-grained variable. Indeed, the exercise below asks you to replicate the plot using a dense, continuous variable.

You can find the Stata code to generate this graphic below–and it is worth reading that code to gain a firmer grasp of how a histogram is constructed and what aspects of the data it illustrates.

Exercise: Generating an Enhanced Quantile Plot

Adapt the code below to make a similar quantile plot for a dense, continuous variable, e.g., Turnout in the Kostelk & Blais data set.

The challenge is to adapt this command -- gen nprob = -((prob))/.8 -- so that the histogram spikes show enough detail without overwhlming the quantile plot.  This is done by trial and error.

Appendix: Graphics in Stata and R

I provide the Stata and R code to replicate the graphics presented above. My code will provide you with many working examples–but both pieces of software offer many more options. You can explore these options and find help remembering commands with data visualization cheat sheets:

Replication of Graphics in Stata

Pre-processing of Data and Bar Charts

If you run these chunks one after the other in a single Stata (or R) session, then you need to load and pre-process the data just once. By contrast, if you run the chunks in isolation from one another, you must add the code lines that load and pre-process the data to every chunk.

//Load data
  clear
  use "Kostelka & Blais Base.dta"

//Prepare Data
* I want only countries with long time series of 20+ legislative elections

    gsort country EL_TYPE year

*_n is nth record in a group; _N is total N records in that group

    by country EL_TYPE, sort: gen N_Elections = _N

    keep if N_Elections > 19 & EL_TYPE ==0 

*Break data into 3 "eras" for interest: 
      * post-war, pre-oil shock; 
      * oil shock to fall of Berlin Wall; 
      * post-Wall-Density

  gen era = 1 if year <1970
   replace era = 2 if year>=1970 & year <1991
   replace era = 3 if year >=1991

*label variables
  label var N_Elections "N elections observed for country"
  label var era "Historical Era"
  label define hist_era 1 "Pre-Oil Shock" 2 "Oil Shock-to-Berlin Wall" 3 "Post-Berlin Wall"
  label values era hist_era

//Bar Charts as a baseline

*Bar chart of mean turnout by country

  graph bar (mean) Turnout, over(country)
        *save file (alter path as necessary):
      graph save "Graph" "bar_graph_1.gph"

*sorting countries by mean turn improves quick interpretation 

  graph bar (mean) Turnout, over(country, sort(Turnout))
        *save file:
        graph save "Graph" "bar_graph_2.gph"

*Tidy labels; add title, source; reduce ink intensity to increase info-to-ink ratio.
  
graph bar (mean) Turnout, over(country, sort(Turnout) descending label(angle(forty_five) labsize(vsmall))) intensity(20) lintensity(100) ytitle("Mean Turnout (%)", size(small)) title("Average turnout at legislative elections in 16 countries, 1945-2017", size(small)) note("Source: Kostelka & Blais (2021)", size(vsmall))
        *save file:
        graph save "bar_graph_3.gph"

*Horizontal orientation via hbar looks better

graph hbar (mean) Turnout, over(country, sort(Turnout) descending label(angle(forty_five) labsize(vsmall))) intensity(20) lintensity(100) ytitle("Mean Turnout (%)", size(small)) title("Average turnout at legislative elections in 16 countries, 1945-2017", size(small)) note("Source: Kostelka & Blais (2021)", size(vsmall))
        *save file:
        graph save "Graph" "bar_graph_4.gph"
        
// Create panel of bar graphs

graph combine "bar_graph_1.gph" "bar_graph_2.gph" "bar_graph_3.gph" "bar_graph_4.gph", col(2) colfirst scale(.7)

Dot plots in Stata

//Ensure data are loaded and pre-processed as in previous chunk

//A Basic Dot Plot

graph dot (mean) Turnout, over(country, sort (Turnout))

//Adding titles, adjusting y-axis scaling etc.

graph dot (mean) Turnout, over(country, sort(Turnout) descending) nofill exclude0 marker(1, mcolor(black) msize(medsmall) msymbol(circle) mfcolor(black)) ndots(50) dots(mcolor(gs14)) ytitle("Turnout (%)") ylabel(0(10)100) ymtick(##2, ticks tlength(relative0p6) tlwidth(thin)) title("Average turnout at legislative elections in 16 countries, 1945-2017", size(small)) note("Source: Kostelka & Blais (2021)", size(vsmall))

//Compare to the horizontal bar chart of the same data.
//Observe how the dot plot more clearly reveals 3 groups 
//Altering the aspect ratio accentuates grouping of data 

graph dot (mean) Turnout, over(country, sort(Turnout) descending label(labcolor("black") labsize(vsmall))) nofill exclude0 marker(1, mcolor(black) msize(medsmall) msymbol(circle) mfcolor(black)) ndots(50) dots(mcolor(gs10))  ytitle("Turnout (%)", size(small)) ylabel(40(10)100) ymtick(##2, ticks tlength(relative0p6) tlwidth(thin) labsize(vsmall)) yline(65 82, lpattern(vshortdash) lcolor(reddish)) title("Average turnout at legislative elections in 16 countries, 1945-2017", size(small) margin(medium)) note("Source: Kostelka & Blais (2021)", size(vsmall)) aspect(.6)

Box plots in Stata

This code chunk is notable because of how I re-rank cases in the box plot. Stata’s box plot sorts cases by their median values by default. I use a loop to fill in cases mean values, and then sort based on those means .

//Box Plots to Convey Information on Distribution

graph box Turnout, over(country, sort(Turnout) descending label(angle(forty_five) labsize(small))) box(1, fcolor(gs12%20) fintensity(20) lcolor(black) lwidth(vthin) lpattern(solid)) intensity(20) lintensity(100) medtype(cline) medline(lcolor(reddish) lwidth(medium) lpattern(solid)) ytitle("Mean Turnout (%)") title("Average turnout at legislative elections in 16 countries, 1945-2017") note("Source: Kostelka & Blais (2021)")

// Order different than bar graph because box plot sorts on median by default.
// I generate mean turnout by country via a loop:

*sort country and summarize it; I see numeric values of country labels go from 2-117
sort country year
summ country

*Generate an empty variable to hold estimates of country's mean turnout
    
    generate MeanTO = .
    
*I tell Stata: start counting from 2 to 117 in steps of 1
    *At each step, compute mean turnout for cases where country equals loop counter
    *Save the mean turnout for cases where country equals Stata's count
    *Increase the count by 1
    
    forvalues i=2(1)117 {
        quietly: summ Turnout if country == `i'
        replace MeanTO = r(mean) if country == `i'
    }

// Look at what this did

list country year Turnout MeanTO if _n<11

// Repeat box plot, sorted by MeanTO and horizontal orientation

graph hbox Turnout, over(country, sort(MeanTO) descending label(labsize(small))) box(1, fcolor(gs12%20) fintensity(20) lcolor(black) lwidth(vthin) lpattern(solid)) intensity(20) lintensity(100) medtype(cline) medline(lcolor(gs10) lwidth(medthick) lpattern(solid))  ylabel(30(10)100) yline(65 82, lpattern(vshortdash) lcolor(reddish)) ytitle("Mean Turnout (%)") title("Average turnout at legislative elections in 16 countries, 1945-2017") note("Source: Kostelka & Blais (2021)")

Histograms in Stata

//Histograms and bin size

*A Basic Histogram with Stata choosing default bin size

 histogram Turnout
 
*See effect of changing K bins

 histogram Turnout, bins(3)
 
 histogram Turnout, bins(30)

*Too much information!

 histogram Turnout, bins(300)

*No information at all!

 histogram Turnout, bins(1) by(country)
 
*Uniform bin size does not work for all cases
 
 histogram Turnout, bins(10) by(country)
  

Kernel Density Estimation in Stata

Some notes on kernel density estimation in Stata.

  1. Stata (and many other programs) use Epanechnikov kernel by default.

  2. The Stata manual states: “The width is specified as a half-width… a KDE with half-width 20 corresponds to sliding a window of size 40 across the data.”

// Kernel Density Estimation (KDE) in Stata. 

* Experiment with bandwidth by altering bwidth() option

kdensity Turnout, bwidth(5)

*Changing the kernel.  
kdensity Turnout, bwidth(5) kernel(rectangle) title("Turnout at legislative elections in 16 countries, 1945-2017") note("Source: Kostelka & Blais (2021)" "Rectangular kernal with bandwidth = 5")

//Setting n of sliding windows used instead of bandwidth
*The n() option refers to the N of "sliding windows"

kdensity Turnout, kernel(rectangle) n(200)


//A key advantage of KDE is less occlusion
twoway(kdensity Turnout if era==1, bwidth(5) lwidth(medthin)) ///
      (kdensity Turnout if era==2, bwidth(5) lcolor(reddish) lwidth(medium)) ///
      (kdensity Turnout if era==3, bwidth(5) lcolor(sea) lwidth(medium) lstyle(solid) ///
      xtitle("Turnout (%)", size(small)) ytitle("Density", size(small))), ///
       legend(on order(1 "1945-1969" 2 "1970-1990" 3 "1991-2017")) ///
      title("Three Eras of Election Turnout:" "Voter turnout at legislative elections in 16 countries, 1945-2017", size(small) margin(medlarge))  ///
      note("Source: Kostelka & Blais (2021)" "Epanechnikov kernal (bandwidth = 5)", size(vsmall)) 
     

Quantile Plots in Stata

//Quantile Plots

//The basic syntax.  Experiment with the aspect ratio.
quantile Turnout, aspect(1)

//Pre-oil shock era.
quantile Turnout if era==1, aspect(1) xtitle(, size(vsmall)) ytitle(, size(vsmall)) recast(scatter) mcolor(none) msize(small) msymbol(circle) mfcolor(none) mlcolor(gs7) mlwidth(vthin)  ylabel(, labsize(vsmall))  xlabel(, labsize(vsmall)) title("1945-1969", size(vsmall))
  *saving result:
  graph save "Graph" "Quantile_plot_example_1.1.gph", replace

//Oil shock-ter-Berlin Wall Era
quantile Turnout if era==2, aspect(1) xtitle(, size(vsmall)) ytitle(, size(vsmall)) recast(scatter) mcolor(none) msize(small) msymbol(circle) mfcolor(none) mlcolor(gs7) mlwidth(vthin) ylabel(, labsize(vsmall))  xlabel(, labsize(vsmall)) title("1970-1990", size(vsmall))
 *saving result:
  graph save "Graph" "Quantile_plot_example_1.2.gph", replace

//Post-Berlin Wall Era
quantile Turnout if era==3, aspect(1) xtitle(, size(vsmall)) ytitle(, size(vsmall)) recast(scatter) mcolor(none) msize(small) msymbol(circle) mfcolor(none) mlcolor(gs7) mlwidth(vthin) ylabel(, labsize(vsmall))  xlabel(, labsize(vsmall)) title("1991-2017", size(vsmall))
   *saving result:
    graph save "Graph" "Quantile_plot_example_1.3.gph", replace

//Much could improved: 
//Use the graph editor and [recorder](https://www.stata.com/support/faqs/graphics/graph-recorder/) to repeat improvements from one graph onto another 

graph combine "Quantile_plot_example_1.1.gph" "Quantile_plot_example_1.2.gph" "Quantile_plot_example_1.3.gph", row(1) ycommon title("Three eras of turnout at legislative elections in 16 countries, 1945-2017", size(small) margin(vsmall)) note("Source: Kostelka & Blais (2021)", size(vsmall))

graph export "Quantile_quantile_example_1.4.png", as(png) name("Graph") 

Adding a histogram or spike plot to a quantile plot


// [Adding a spike plot to margin of a q-plot](https://journals.sagepub.com/doi/pdf/10.1177/1536867X211045583)


use "Secret Ballot\Secret Ballot.22.2.2025.dta", clear

keep if last_record == 1

//counts non-missing values on variable of interest

count if NofPMBs_68 ~=. 

// prob is total sum of 1/N at each value of NofPMBs_68, where N is N of observed values of NofPMBs_68
// Thus, we compute fraction of data that obs where NofPMBs_68=0, 1, 2...22 represent. 

egen prob = total(1/`r(N)'), by(NofPMBs_68)

// Tags 1 observation of each unique value of NofPMBs_68. Need only 1 such observation
// because spikes reflect fraction of data that obs of NofPMBs_68=1, 2 represent

egen tag = tag(NofPMBs_68)

// prob(ability) x -1 changes direction of spikes; divide or use sqrt(prob) to alter spike size

gen nprob = -((prob))/.8

quantile NofPMBs_68 , rlopts(lc(none) ms(oh) lpattern(shortdash)) legend(off) addplot(spike nprob NofPMBs_68 if tag==1 , sort horizontal lwidth(medthick) color(gs8)) ymtick(##5) xlab(0(.25)1) xmtick(##5) msymbol(circle) msize(medsmall) mfcolor(gs12%40) mlcolor(gs6) mlwidth(vvthin) ytitle("N ballot motions attended", size(medsmall)) aspect(1) title("Quantile Plot of Number of Motions" "on Secret Ballot Attended by MPs, 1834-1868", size(medsmall) margin(medsmall))



Replication in R Using ggplot2

One strength of R’s ggplot2 library is its logical and uniform structure. The general structure of a ggplot2 visualization involves specifying 3 elements: 1) the data; 2) the aesthetics; 3) and the geometric objects:

ggplot(data = DF,  
 mapping = aes(x=V1, y=V2)  
   ) +  
    geom_TYPE()  
  where:  

- DF = name of data frame  

- V1 = label x variable  

- V2 = label y variable  (if applicable)

- TYPE = type of geom (histogram, density, dot, etc.) 

See Ch. 1 of Wickham et al.

Setup

This code chunk loads and pre-processes the data to select legislative elections in countries that have a long history of democratic elections. This is simply to ensure that country means are based on more than 2-3 data points.

I allow the R code in these chunks to be evaluated and echoed so that you can see the results immediately. If this is too verbose for you, load up the markdown file, and amend the code chunk’s prefix to {r} to {r, eval=FALSE, echo=FALSE}.

#Load libraries
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(ggplot2)
library(janitor)
library(haven)

#Load Data
#Change path as required
Kostelka_Blais_Base <- read_dta("P:/pCloud Drive/Teaching/Teaching 2025/POLI381/Data & Code/Kostelka & Blais - Turnout/Kostelka & Blais Base.dta") 
elections_data<-Kostelka_Blais_Base
#View(elections_data)

#Filter out presidential elections (EL_TYPE ==1)
elections_data <- filter(elections_data, EL_TYPE==0)

#Compute N records per country, filter out those with <20
elections_data <- elections_data %>%
     group_by(country = as.character(country)) %>%
     mutate(NumRecords = n()) %>%
     ungroup() %>%
     filter(NumRecords >= 20)
elections_data <- elections_data %>%
  mutate(era = case_when(
    year < 1970 ~ 1,
    between(year, 1970, 1990) ~ 2,
    year > 1990 ~ 3
  ))

Histograms in ggplot2

A basic histogram in ggplot2. This basic histogram needs a lot of remedial work.

ggplot(data = elections_data, aes(x=Turnout)) + 
  geom_histogram(binwidth=1)

KDE in ggplot

You can see from the KDE syntax that you can control the kernel and the bandwidth in ggplot2. Refer to the cheat sheet to see what other kernels are permissible. Remember, however, the bandwidth is the more important element of the KDE.

ggplot(data = elections_data, aes(x=Turnout)) + 
  geom_density(kernel = "gaussian", bw=3)

Quantile plot in ggplot

One important point to note is that ggplot’s geom_qq function generates a quantile plot based on a normal distribution. Specifically, the x-axis represents standard deviations from the mean of a normal distribution, meaning the plot shows the value of the variable at a given number of standard deviations below (or above) the mean. For example, from a standard normal cumulative distribution table, we know that 5% of observations fall at least 1.64 standard deviations below the mean. The reference line that ggplot superimposes on the plot follows the cumulative normal distribution and serves as a benchmark: if the data points closely follow this line, it suggests that the variable is approximately normally distributed.

That said, other distributions can also serve as benchmarks. A uniform distribution, for instance, is a common alternative because its cumulative distribution function (CDF) appears as a 45-degree line on a quantile plot. Stata provides flexibility in this regard, allowing users to specify different reference distributions: quantile varname plots ordered values of varname against quantiles of a uniform distribution, while qnorm varname does so against a normal distribution.

It is not immediately clear from ggplot’s documentation whether it allows users to change the reference distribution as easily as Stata does. However, in principle, this functionality could be implemented manually. For a uniform distribution, the mapping is conceptually straightforward: if a variable \(X\) has \(N\) values (i.e., X ranges from 1 to N), then the 25th percentile should correspond to a value of \(N/4\), the 50th percentile, to a value of \(N/2\), the 75th percentile, to a value of \(3N/4\), and so on.

Changing colors

My first quantile plot example shows the use of the color = “” option.

ggplot(data = elections_data, aes(sample=Turnout)) + 
  geom_qq(color = "blue")

Superimposing elements

My second quantile plot shows how to superimpose the 45 degree line on the quantile plot. This is effected by adding a STAT to the ggplot command.

ggplot(data = elections_data, aes(sample=Turnout)) + 
  stat_qq(color="blue") +
  stat_qq_line()

Axis labels and background themes

My third quantile plot shows how to alter x-axis labels and the plot area theme / background.

ggplot(data = elections_data, aes(sample=Turnout)) + 
  stat_qq(color = "blue") +
  stat_qq_line() +
  theme(axis.text.x= element_text(face="bold", colour="red", size=16, angle=45))

  #scale_x_continuous(breaks=seq(-3,3,by=1)) +
  #theme_bw()
Facets

My final quantile plot shows how to use facts to condition the variable of interest. Comment out the facets option, and see what happens.

#generate three "eras"
elections_data <- elections_data %>%
  mutate(era = case_when(
    year < 1970 ~ 1,
    between(year, 1970, 1990) ~ 2,
    year > 1990 ~ 3
  ))
summary(elections_data$era)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.000   2.000   2.041   3.000   3.000 
ex6 <-ggplot(data = elections_data, aes(sample=Turnout, color = factor(era))) + 
  facet_grid(~era) +  #comment out this line to see effect on graphic
  stat_qq()+
  theme_classic()
ex6

#ex6 + scale_colour_grey()  #uncomment this to obtain a less flamboyant graphic

Summary

Univariate graphics offer more nuanced insights compared to tables, especially when variables are continuous in nature. Different types of graphics are suited to different data characteristics and research questions, but there are overarching principles of graphical construction: maximize the information-to-ink ratio and minimize occlusion. Be aware that most software packages have “optimal” default settings. That does not mean that you should surrender your judgment as to what graphic is best suited to represent a given variable or data set, or that you should not familiarize yourself with the choices the software is making for you.




Chapter 7: Constructing and Evaluating Indexes

Introduction

In social science research, indices are commonly used to summarize and measure complex concepts using multiple indicators. For example:

Constructing indices requires careful attention to how individual indicators are combined because choices in design can affect the index’s interpretability and utility.

What is an Index?

An index aggregates several indicators into a single measure, typically by summing, averaging, or applying more complex transformations.

Key Definitions:

  • Indicator: A measure or variable representing a specific aspect of a concept (e.g., the proportion of women in secondary education as an indicator of gender equality).

  • Dimension: A grouping of indicators that are conceptually or empirically similar.

Example: The Polity V Regime Index

  • Combines democratic and autocratic indicators.

  • Polity Score = Democracy Score – Autocracy Score

  • The scores for democracy and autocracy are based on five key indicators:

  1. Competitiveness of executive recruitment
  2. Openness of executive recruitment
  3. Regulation of political participation
  4. Competitiveness of political participation
  5. Executive constraints

The aggregation of these indicators creates an annual score for each country, capturing its regime characteristics.

Why Use an Index?

We construct indices because individual indicators may only partially capture the concept of interest. Indices are useful in two broad situations, where:

  1. The concept is multidimensional: Different indicators capture different dimensions of the concept (e.g., economic and social aspects of gender inequality).

  2. The concept can be scaled: Aggregating indicators helps measure intensity or magnitude (e.g., summing responses to survey questions to gauge political knowledge).

Principles for Constructing Indices

Simplicity vs. Complexity

Simplicity is often preferable when constructing indices. Simple indices:

  • Are easier to interpret and diagnose for errors.

  • Avoid the problem of multiple combinations producing the same score.

Aggregating a few well-defined indicators may provide clearer results than combining dozens of loosely related variables.

The “Many Paths to a Single Score” Problem

As noted above, one critical issue when constructing indices is that different combinations of indicator values can produce the same overall score. For example, consider an index constructed from three dichotomous indicators (each taking values of 0 or 1). An index score of 2 could result from any one of the following combinations:

Multiple Paths to the Same Index Score
Indicator_1 Indicator_2 Indicator_3 Index_Score
1 1 0 2
1 0 1 2
0 1 1 2

The question arises: Are these combinations conceptually identical? In many cases, they are not. The absence of one indicator may carry different implications depending on which specific indicator is missing. This issue highlights the importance of carefully considering the interpretation of the index and whether weights or aggregation rules should be adjusted.

The Polity V index provides real-world instance of this problem:

Polity V: Competitiveness of Political Participation
Category Contribution.to.Democracy.Score Contribution.to.Autocracy.Score Contribution.to.Polity.Score
Competitive 3 0 3
Transitional 2 0 2
Factional 1 0 1
Restricted 0 1 -1
Suppressed 0 2 -2
Polity V: Regulation of Political Participation
Category Contribution.to.Democracy.Score Contribution.to.Autocracy.Score Contribution.to.Polity.Score
Regulated 0 0 0
Multiple Identity 0 0 0
Sectarian 0 1 -1
Restricted 0 2 -2
Unregulated 0 0 0

Observe that the Polity V indicators:

  • Are assumed to be interval-level measures;

  • Have different ranges;

  • Make unequal contributions to the democracy and/or autocracy dimension.

And the aggregation rules:

  • Generate hundreds of possible combinations,

  • many of which result in identical Polity scores;

  • some of which are more frequent than others by construction;

  • yet which also logically proscribe some combinations of democracy and autocracy scores!

These properties mean that the interpretation of the Polity Index is not straightforward!

Aggregating Indicators: Common Problems and Solutions

Whenever we aggregate indicators to form an index, we want those indicators to make roughly equal contributions to the final score of the index. This requires that indicators have comparable ranges and variances. When this requirement is not met, we are likely to end up with an index that distorts rather than accurately summarizes the underlying data.

Unequal Ranges

Imagine that our index is the sum of two indicators:

\[ X = x_1 + x_2 \]

where:

  • \(x_1\): min = 0, max = 100
  • \(x_2\): min = 0, max = 1

Clearly, \(x_2\) makes little contribution to \(X\) relative to \(x_1\).

Rescaling as a possible solution

One solution is to re-scale the indicators so that they have the same range. However, it’s not clear whether we should divide \(x_1\) by 100 or multiply \(x_2\) by 100.

In addition, suppose we chose the latter option, but \(x_2\) was a dichotomous variable that took on values of 0 or 1. The consequence would be that scores under 100 on the index would only be possible when \(x_2 =0\), and hence solely due to \(x_1\). Furthermore, distribution of the index is likely to be “lumpy”, with a big jump or discontinuity in the distribution at the 100-point mark.

An alternative approach is to change the aggregation rule. For example, if both indicators were continuous, then a multiplicative index,

\[ X = x_1 \times x_2 \]

might work well in the sense that both variables would make roughly equal contributions to the final score of the index, and the index’s distribution would be smooth rather than “lumpy”.

Unequal Variances

Our index remains the sum of two indicators:

\[ X = x_1 + x_2 \]

but now:

\[ E(x_1) = 50, \quad Var(x_1) = 0.25 \] \[ E(x_2) = 0.5, \quad Var(x_2) = 0.25 \]

The issue is no longer that \(x_1\)’s contribution to \(X\) is 100 times \(x_2\)’s, but rather that \(x_1\) is almost constant with the effect that \(X\) is essentially:

\[ X \approx x_2 + 50 \]

One solution in this situation is to use only \(x_2\).

Standardization as a possible solution

A common reaction to such situations is to standardize both indicators. Standardization converts indicators to have a mean of 0 and a standard deviation of 1. This is done by subtracting the variable’s mean from each observed value, and then dividing by the variable’s standard deviation:

\[ x^* = \frac{x - \bar{x}}{s_x} \]

The implicit assumption in such an approach is that the minor variance in \(x_1\) around 50 is meaningful. Stated more generally, standardizing indicators assumes that differences in variance across indicators are meaningful and should be accounted for.

We can assess how well standardization works in this scenario by generating some artificial data with 100 observations, followed by standardization.

Summary Statistics of the Artificial Data

Variable Mean SD Min Max
x 0.47 0.29 0.003 0.971
y 50.47 0.28 50.004 50.998
x* 0 1 -1.602 1.706
y* 0 1 -1.167 1.88



Standardization in such situations (i.e., where raw indicators have extremely different variances) is not inherently wrong. Indeed, as the figure below shows, it has produced an index with smooth, symmetrical density function. However, standardization does not alter fact that extreme values on the index \(x^∗ + ^y∗\) are mainly due to variance in \(x^∗\).

Figure: Index After Standardization

Highly Correlated Indicators

When indicators are highly correlated, aggregating them may provide limited new information and lead to “clumping” at extreme values.

Consider the figure below: if one knows the value of \(x_1\), one can predict the value of \(x_2\) with a high degree of accuracy. That means, that \(x_2\) adds almost no information to the index beyond what’s already conveyed by \(x_2\). In creating a gender equality index, for example, one may find that female labor force participation and female university graduation rates are highly correlated. As a result, including both indicators in an index may not significantly improve measurement.

Figure: Two Highly Correlated Indicators

As an example of let \(X = x_1 + x_2 + x_3 + x_4 + x_5\), where are the \(x\)’s are dichotomous and highly and positively correlated. In other words, cases that have \(x_1 =1\) tend to have \(x_2 =1\) and \(x_3 =1\), and so on. The figure below show the probability density (or mass) function (PDF) of the resulting index. The strong positive correlations between indicators results in most cases being clustered at 0 or 1.

Figure: The PDF of an Index of Highly Correlated Indicators

Principal component analysis (PCA) as a possible solution

If all indicators are very highly correlated (e.g., in the neighborhood of $= .9 or above), drop some of the indicators. At the limit, when two or more indicators are perfectly correlated, use a single indicator in place of the index: it’s more straightforward to code, measure, and interpret. In choosing which indicator to rely on in place of the index, choose the one with the greatest variance. An alternative when the collinearity between indicators is less severe is to use principal component analysis (PCA) to reduce redundancy.

Scaling and Unidimensional Scales

Unlike multidimensional indices, scales are designed to measure a single underlying concept. A scale typically consists of items that follow a progression in difficulty or specificity.

Example: A Political Knowledge Scale

  1. What party currently forms the government?

  2. Does the government control a majority of parliamentary seats?

  3. Has the government recently voted to provide military support to Ukraine?

Guttman Scaling

  • A unidimensional scaling method based on cumulative patterns of responses.

  • Items are ordered such that agreement or success on more difficult items implies agreement or success on easier items.

Agreement Matrix Example: Rearranging Policies for Agreement

Consider scaling legislators based on their agreement with specific policy votes. Suppose we have the following initial binary agreement matrix, where 1 indicates agreement and 0 indicates disagreement:

Legislator-Policy Agreement Table
Legislator P1 P2 P3 P4 P5
L1 1 1 0 1 0
L2 1 1 1 1 1
L3 0 0 1 0 0

By conceptually rearranging the columns, we aim to group policies such that agreement patterns among legislators are clearer. Rearranging the matrix based on policy similarity, we get:

Legislator-Policy Agreement Table (Reordered Columns)
Legislator P1 P2 P4 P5 P3
L1 1 1 1 0 0
L2 1 1 1 1 1
L3 0 0 0 0 1

This reordering highlights the similarity between L1 and L2, who agree on most policies, and the divergence of L3, who agrees only on P3.

We can represent this agreement spatially by placing legislators on a line according to their similarity:



[L3] ———————— [L1 or L2] – [L1 or L2]

Given that L1 and L3 disagree on most policies but share agreement on P5, and L2 agrees with L3 on P3, there is some indeterminacy in the exact relative positions of L1 and L2 with respect to L3. This underscores a key limitation: when the data contains mixed agreement patterns, it may not be possible to generate a unique and unambiguous ordering of the legislators.

The key insight is that the spatial locations are relative, not absolute. The same configuration could be expanded or contracted along the line without changing the meaning of the representation. This means that we can only interpret relative positions, not precise distances.

A variety of statistical procedures, such as multidimensional scaling or correspondence analysis, can perform these transformations mathematically. However, the conceptual effort of identifying agreement and similarity patterns forms the foundation of these techniques.

Practical Considerations

Keep constituent indicators in the data set

If you are building your own index, keep the indicators that make up the index in the data set. Also ensure that the syntax necessary to construct the index from those indicators accompanies the data set as part of the documentation. This allows you and others to both replicate and validate (see below) your index.

Learn what indicators make up a third-party index

It is not always possible to obtain the indicators and algorithm used to construct a third-party’s index (e.g., the fragile state index), but you should read any documentation to understand what indicators are used to build the index. This is important to avoid situations where you unwittingly employ an indicator from the index as an dependent variable in a model that uses the index as an independent variable or vice versa. Such a model would show a strong correlation between the two variables–but it would be entirely spurious!

Missing Data

You should always examine the missingness patterns of the indicators that comprise an index. Consider an extreme but illustrative example: Our index is \(X=x{_1}+x{_2}\), but the data are such that for cases for which we observe \(x_1\) we rarely observe \(x_2\) and vice versa. If we employ listwise deletion, we will be forced to rely on the small number of cases for which both \(x_1\) and \(x_2\) are observed. Things are worse if we employ casewise deletion to construct the index: \(X\) would be reduced to \(x_1\) in some cases, and \(x_2\) in others; and for no case would \(X=x{_1}+x{_2}\)! Furthermore, standard diagnostics (e.g., graphical examination of \(X\)’s statistical distribution) would not necessarily catch this problem.

Aggregation Rules

The choice of aggregation method (sum, average, product) should align with the theoretical framework. For example, multiplicative aggregation is appropriate when theory suggests that indicators interact in complex ways, with one indicator hypothesized to amplify or dilute the impact of another, for example. By contrast, additive indicators make sense when theory suggests that the indicators’ effects accumulate.

Validating Indices

  • Conduct sensitivity analysis to test how robust the index is to different assumptions. For example, if an index is comprised of 3 indicators, a, b and c, you can create alternative formulations of the index from subsets of the indicators, i.e., {a, b}, {a, c}, and {b, c} . Comparing the properties of these alternative formulations to one another and to the index itself can alert us as to whether one indicator makes a disproportionate or idiosyncratic contribution to the index. The techniques shown in Chapters 6 and 7 readily lend themselves to such an application.

  • Assess construct validity by comparing the index with external measures or known benchmarks. This comparison might be qualitative and approximate in nature. For example, we could rank countries on the basis of our index and consider whether the ranking makes intuitive sense. However, there are more rigorous and reliable ways to benchmark an index.

-If the both the index and benchmark are continuous, for example, one can plot one against the other in a scatterplot. A lowess regression line (Chapter 7), can then be used to assess whether the index varies monotonically with the benchmark.

-If the index is discrete, then one can regress the benchmark on the index, with the index coded as a series of dummy variables, one for each point on the index. If the coefficients on the index dummies are approximately equal, we can infer that the index behaves as if it were an interval measure. This, in turn, provides some assurance that the indicators each relate to the concept underpinning the index in a similar and stable manner.

Exercises

Exercise 1: Constructing a Simple Index
Create a dataset containing the following variables for 10 countries:

- X1: Female labor force participation rate (%)
- X2: Percentage of women in secondary education
- X3: Percentage of parliamentary seats held by women

Tasks:

1. Rescale the indicators to have a range from 0 to 1.
2. Construct a simple index by averaging the three indicators.
3. Interpret the results and identify which countries score the highest and lowest.
Exercise 2: Using Standardization
Use the same dataset as in Exercise 1.

Tasks:

1. Standardize each indicator to have a mean of 0 and standard deviation of 1.
2. Construct an index using the standardized values.
3. Compare the results with the simple index and discuss any differences.
Exercise 3: Dealing with Correlated Indicators
Introduce a new indicator, X4, which is highly correlated with X1.

Summary

This chapter provides a comprehensive guide to constructing and scaling indices, highlighting best practices, potential pitfalls, and practical applications in social science research.

Appendix: Generating Artifical Data in R and Stata

Here’s the code that I used to generate the artificial data for some of the examples above.

# Here is R code to replicate my simulation.
# Load necessary library
library(dplyr)

# Set seed for reproducibility
set.seed(123)

# Generate artificial data
N <- 100
x <- runif(N, min = 0.003, max = 0.971)  # Uniform distribution
y <- runif(N, min = 50.004, max = 50.998)

# Standardize the data
x_standardized <- (x - mean(x)) / sd(x)
y_standardized <- (y - mean(y)) / sd(y)

# Create summary statistics
summary_data <- data.frame(
  Variable = c("x", "y", "x*", "y*"),
  Mean = c(mean(x), mean(y), mean(x_standardized), mean(y_standardized)),
  SD = c(sd(x), sd(y), sd(x_standardized), sd(y_standardized)),
  Min = c(min(x), min(y), min(x_standardized), min(y_standardized)),
  Max = c(max(x), max(y), max(x_standardized), max(y_standardized))
)

# Print the summary table using kable
library(knitr)
library(kableExtra)

kable(summary_data, caption = "Summary Statistics of the Artificial Data") %>%
  kable_styling(full_width = FALSE, position = "center")

Stata code to generate the same artificial data is:

* Set seed for reproducibility
set seed 123

* Generate artificial data
set obs 100
generate x = runiform(0.003, 0.971)
generate y = runiform(50.004, 50.998)

* Standardize the data
generate x_standardized = (x - r(mean)) / r(sd) if sum(x) == sum(x)
summarize x
replace x_standardized = (x - r(mean)) / r(sd)

generate y_standardized = (y - r(mean)) / r(sd) if sum(y) == sum(y)
summarize y
replace y_standardized = (y - r(mean)) / r(sd)

* Create summary statistics for all variables
summarize x y x_standardized y_standardized




Chapter 8: Bivariate Graphics

Overview

In this chapter, we explore graphing bivariate relationships using three fundamental and flexible frameworks:

  1. Time series trend plots:

  2. Scatterplots

  3. Quantile-Quantile Plots

We will examine methods to scale and smooth data within these frameworks, focusing on the role of moving averages and lowess (or loess) regression.

Time Series trend (Trendline) Plots

Time series, time series trend, or trend-line plots are essential to the visualization of temporal data. (I will refer to them as “time series” plots.) A time series plot shows whether a quantity increases or decreases over time. As importantly, time series plots reveal changes in volatility, the sequence of cause and effect, and the impact of interventions on the time series–insights that are less obvious but crucial for analysis.

However, time series plots must be carefully constructed because they easily become cluttered as lines crisscross. Reference and grid lines add to the clutter. My advice is to limit each plot to three time series at most; sometimes, even that is too many.. Take as an example the time series plot below. The plot is derived from Besley and Burgess’s (2000) data on flooding in Indian states. It depicts the annual flood affected area for each of 4 states from 1950 to 1995. Following Kastellac’s (2025) advice:

  • I use color to distinguish the four time series;

  • I attach similarly colored labels to the series of each state to avoid using a legend that takes up space and forces the reader’s eyes to saccade from data to legend;

  • I ensure that axes are labelled–and that those labels are large enough to be easily legible.

For all that, the plot remains a colorful mess–and it only plots 4 time series; there are 12 other states in the data set.

A better option in this case, and in many others, is to employ facets (also called “panels” or “small multiples”) that show the same plot for each state in miniature.


  • My improved time series plots employs a minimalist gray-scale design that avoids visual distraction. Unlike Schwabish (2021) and Kastellac (2025), who advocate for use of color, I follow Cleveland’s advice to use color sparingly; in this re-organized time series plot, it is unnecessary.

  • The deliberate use of identical x-axis and y-axis scales–notwithstanding the varying magnitudes of flooding across state–is important. This uniformity ensures that the reader only needs to decode the plot’s structure and meaning once.

  • The ordering of the data is also critical: I arrange the states from least (top-left) to most (bottom-right) flooding. This avoids the arbitrariness of alphabetical ordering. (This was actually a chore to do in Stata as my code at the end of the chapter shows.)

  • I duplicate the y-axis on the right to provide the reader with a visual anchor, given that the graph spans four facets. Rather than replicating y- and x-axis labels for every facet–which would introduce redundant clutter–the second y-axis offers a cleaner and more effective way to facilitate horizontal comparisons. With fewer facets (e.g., two wide), this second axis would be unnecessary. I also adjusted the aspect ratio to slightly compress the width, further enhancing horizontal comparisons.

Moving Averages to Smooth Time Series Data

Highly variable time series data can obscure trends, making it difficult to draw meaningful conclusions. For example, we might wonder whether voter turnout has declined in U.S. elections over time. By applying smoothing techniques, we can address this issue and uncover any underlying patterns that may be hidden by erratic fluctuations.

Implementing a Moving Average

Implementing a moving average involve five conceptual steps:

  1. Order the data: Arrange observations \(X_t\) sequentially by time, from \(t=1\) to \(t=T\).

  2. Define the window: Choose a window or span of \(k\) time periods where \(k \geq 2\) and \(k \ll T_k\).

  3. Compute the moving average: For the first window, compute the mean of \(k\) observations from \(X_{t=1}\) to \(X_{t=k}\).

  4. Shift the window: Move the window forward by one period.

  5. Repeat: Continue computing the mean for subsequent windows.

Practical Example: Programming a Moving Average in Sheets

While many software programs (including R, Stata, and many spreadsheets) automate this process, it is instructive to compute a moving average step-by-step. This allows one to better understand what your software is doing when you instruct it to smooth a time series by computing a moving average. We will do this in Sheets because it is particularly transparent.

Implementing Moving Averages in Sheets
    1. Set up the data set:
        ◦ Cell A1 -> “Year” 
        ◦ Cell A2 -> 1980 
        ◦ Cell A3 -> = A2+1 (Fill down for 20-30 rows) 
    2. Generate random data:
        ◦ Cell B1 -> “Y” 
        ◦ Cell B2 -> = (A2 * 0.1) + 15 * RAND() 
    3. Fix random values: Copy > Paste Special > Values Only to paste the values into Column C to prevent random number changes.
    4. Calculate the moving average:
        ◦ In D1, label the column as “MA(Y)”. 
        ◦ In D5, use the formula: =AVERAGE(C2:C5) and fill down. 
    5. Create a trend-line chart:
        ◦ Highlight the data and select the chart icon. 
        ◦ Compare the raw data (Column C) with the moving average (Column D).

Scaling and Transformation in Multivariate Plots

Several challenges can arise when we try to visualize the relationship between two variables. These challenges include:

  • Different magnitudes across variables: When the variables differ in scale or magnitude, one variable may visually dominate the plot, masking important patterns in the other.

  • Complex or non-linear relationships: If the relationship between variables is non-linear or involves interactions, simple scatterplots may fail to convey the underlying dynamics. The difficulty is that nature does not typically produce linearity; humans, nonetheless, find it much easier to perceive linear relationships in complex data. Therefore, if we can linearize a bivariate relationship in a graphical setting, it is generally desirable to do so.

  • Clustered observations due to skewness: Data skewness can cause observations to cluster, leading to occluded points and a loss of information.

Logarithmic Transformations

Addressing these issues often requires transforming the variables to improve interpretability and reveal hidden patterns. Logarithmic transformations are particularly good for these purposes because they convert multiplicative relationships into additive ones. This often has the effect of linearizing bivariate relationships, making it easier to interpret data across varying scales. Logarithmic transformations are also useful when the range of one variable is several orders of magnitude larger than the range of other variables.

Recall from your high school mathematics that a logarithm is the exponent to which the base must be raised to give the desired number. A logarithm is thus the answer to the question: “To what exponent must a given base be raised to produce this specific number?” Mathematically, if \(b^x = y\), then \(\log_b(y) = x\), where:

  • \(b\) is the base of the logarithm (e.g., 10, 2, or \(e\)),
  • \(y\) is the result of the exponentiation,
  • \(x\) is the exponent to which the base must be raised to get \(y\).

For example, \(\log_{10}(100) = 2\) because \(10^2 = 100\).

Common bases

In theory, we can use whatever base we please. However, three bases are particularly useful:

  • log10: Useful for interpreting 10-fold changes. For example, a 1-unit increase in log10 corresponds to a tenfold increase in the original variable. Base 10 partners well with metric measurements.

  • Natural log (ln): Common in calculus-based models and regression models because the derivative of \(ln(x) = \frac{1}{x}\).

  • log2: Useful for understanding doubling effects. A 1-unit increase in log2 implies that the original variable doubles. This is very intuitive.

Converting Logarithm Bases

If you want to convert between logarithm bases, you can use the logarithm change of base rule:

\[ \log_a x = \frac{\log_b x}{\log_b a} \]

For example, if you need \(\log_2 8\):

\[ \log_2 8 = \frac{\log_{10} 8}{\log_{10} 2} \]

Odds Ratios and Logits

When dealing with proportions or percentages, re-expressing the data as odds ratios or log-odds can help linearize relationships and simplify analysis.

  • Odds ratio: \(\text{Odds} = \frac{p}{1-p}\) where \(p\) is the proportion.

  • Logits: Taking the logarithm of the odds ratio produces a logit, which is often used in regression models involving binary or proportional outcomes.

    For example, if turnout is 80 percent (i.e., \(p = .8\) ), the odds ratio would be \(\frac{0.8}{0.2} = 4\), and the logit would be \(\log(4)\).

Practical Example:

The series of scatterplots in the graphic below uses the Kostelka & Blais (2021) data set on electoral turnout, As before, I have limited the scope of the data to legislative elections in countries with 20 or more elections included in the data set. The purpose of this graphic is solely to highlight the rationale for re-scaling variables and to demonstrate some methods to do this. The Stata and R code to produce these graphics is in the appendix below.

  • In Panel A, turnout is shown as a percentage of registered voters. This is not only reflects how we tend to define and conceive of turnout–as the fraction of eligible voters who vote–but also standardizes turnout across electorates of different sizes to facilitate comparison. Electorate size, by contrast, varies widely across countries and is highly right-skewed, with many observations clustered and obscured in the plot’s top-left corner.

  • Panel B applies a log10 transformation to the electorate size, which eases the skew and provides a clearer view of the data distribution. This transformation spreads out the smaller electorate sizes and improves visibility.

  • Panel C uses an odds ratio to transform turnout. While this reduces the clustering seen in Panel A, it introduces extreme outliers on turnout that distort the visualization. The trade-off here is between reducing occlusion and managing the influence of outliers on interpretation.

  • Panel D employs secondary axes to translate logged quantities back to more intuitive values. However, the axes have the effect of “crowding” the graph region. The data cloud is also no more linear than in Panel B. The attempt to balance intuitive interpretation with data clarity highlights the challenges of using secondary axes effectively.

My last effort to visualize these data combines the best aspects of Panels B and D. It retains the log10 transformation of electorate size for clarity while using only a secondary x-axis to avoid too much clutter. The result is a scatterplot that provides a linear perspective on the data without excessive complexity.

Logarithms with Different Bases in R and Stata

In R, use log(x, base) to compute logarithms with any base. If base is omitted, e is assumed (log(x) = ln(x)). R also provides shortcuts for common bases: log2(x) = log(x, 2) and log10(x) = log(x, 10).

In Stata, log(x) and ln(x) both compute the natural logarithm (ln(x)), while log10(x) computes base 10 logarithms. To compute logarithms of other bases, you must apply the change of base rule (see above). For example, to compute \(\log_2 x\) in Stata, type:

gen y = log(x) / log(2)

Lowess (or Loess) Smoothing

A key distinction in data modeling is between parametric and nonparametric estimation:

  • Parametric estimation assumes a specific functional form for the relationship between variables. For example, linear regression assumes that the relationship is linear and that the residuals follow a normal distribution.

    The advantage of parametric methods lies in their simplicity and interpretability: when the assumptions hold, they provide precise and efficient estimates. However, these assumptions can be restrictive and, when violated, can produce misleading results.

  • Nonparametric estimation makes no such assumptions about the functional form or distribution, allowing for more flexibility in capturing complex relationships. Nonparametric methods are more adaptable and can reveal patterns that parametric models might miss, but they run the risk of over-fitting if not applied carefully.

Lowess (locally weighted regression) is a nonparametric technique that fits smooth curves to multivariate data by allowing the data to “speak for themselves.” It is particularly useful when:

  • Theoretical guidance is limited: If there is no prior knowledge of the functional relationship between the variables, lowess can provide an exploratory look.

  • The data are complex: The presence of noise, non-linearity, or outliers makes a flexible approach desirable.

However, the flexibility of lowess comes with a trade-off: while it is adept at extracting an interpretable signal from noisy data, it is also prone to over-fitting. Over-fitting occurs when the lowess curve closely follows random fluctuations in the data, capturing noise rather than the underlying trend.

Steps for Lowess Smoothing

This series of graphics (based on Cleveland 1993, 95) illustrates the conceptual process of lowess smoothing. Understanding these steps will help you appreciate what the software’s lowess algorithm is doing when it automatically fits the curve to your data.

Step 1: Fix the window size, \(\alpha\)
  • Order the data: Sort observations \((x_i,y_i)\) by the independent variable, i.e. from smallest to largest \(x_i\).

  • Define intervals: Divide the range of \(x\) into intervals (windows) centered on evaluation points \(v_j\).

  • A “tension” parameter, \(\alpha\), controls the width of these windows. When \(\alpha = 1\), the window’s width is set such that it contains 100 percent of the data; when \(\alpha = .5\) the window’s width is set such that it contains 50 percent of the data, and so on.

Step 2: Weight the data
  • Weight the data points in the window based on a kernel (just as in KDE). Just as with KDE, a variety of kernels are available–but the width of the window tends to be the more important parameter.

Step 3: Estimate local regression
  • Estimate local regression: Regress \(y\) on \(x\) within each window. A control parameter, \(\lambda\), controls the degree of polynomial on \(x\) in these regressions.

  • Predict values: Use the regression results to predict \(y\) values at evaluation points. (The illustrated best-fit line may strike you as too flat given the data-–but recall that the data are weighted.)

Step 4: Slide window one obervation and repeat
  • The window is then moved to the right by one observation, and the next local regression is estimated. This is repeated until the window hits the the right-edge of the data.

  • Connect the points: Join the predicted values to form a smooth curve.

Setting Control Parameters

  • \(\alpha\) (window width): Controls the bias-variance trade-off by determining the width of the window. A small \(\alpha\) risks over-fitting by capturing noise, while a large \(\alpha\) may smooth the data so much that it obscure important features. We have encountered this sort of trade-off before, with both histograms and kernel density estimates.

  • \(\lambda\) (degree of polynomial): Determines the functional form of the local regression. For example:

    • When \(\lambda = 1\), the local regression fits a linear model of the form \(y = \beta_0 + \beta_1x\), which assumes a straight-line relationship between \(x\) and \(y\).
    • When \(\lambda = 2\), the local regression fits a quadratic model of the form \(y = \beta_0 + \beta_1x + \beta_2x^2\), which allows for curvature, capturing relationships where \(y\) may increase and then decrease or vice versa.
Practical Implementation
  1. Set \(\lambda\) by inspection:
  • If the data cloud is monotonic, that is, it rises or falls uniformly across the range of \(x\), use \(\lambda = 1\).

  • If the data cloud is non-monotonic, that is, it rising and then falls (or vice versa) across the range of \(x\), then use \(\lambda = 2\).

  • You should almost never need \(\lambda > 2\).

  1. Determine \(\alpha\) iteratively: Start with a small value and gradually increase until the lowess curve is appropriately smooth. If the lowess curve is almost a straight line slope, reduce it a little – because at that point lowess provides little more insight than linear regression of \(y\) on \(x\)
Evaluating the Lowess Model
  • Residual analysis: Plot the residuals against the independent variable. The residual plot should show no systematic patterns if the lowess curve is well-fitted. Residuals are the differences between the observed values and the values predicted by the lowess curve. A well-fitted lowess curve should capture all systematic variation in the data, leaving only random noise in the residuals. If the residuals show a pattern, such as a trend or clustering, it suggests that the lowess curve has not fully captured the underlying relationship between the variables. This could indicate the need to adjust the bandwidth \(\alpha\) or the degree of the polynomial \(\lambda\) to better fit the data. The goal is to ensure that the residuals are distributed randomly around zero, with no discernible structure, indicating that the model has effectively removed any predictable components.

Practical Example: Fitting and Diagnosing Loess

This example uses geom_smooth and stat_smooth in R’s ggplot2 to fit a loess curve to the Kostelka and Blais data, and then to diagnose the fit. Both geom_smooth and stat_smooth provide options to set bandwidth and polynomial degree. I provide parallel Stata code at the end of the chapter.

Step 1: Choose a bandwidth and estimate the loess function

I begin by setting the bandwidth of the loess curve to .4 (`span = .4).

# A tibble: 6 × 46
  country        year EL_TYPE       PLT_dem0_time Vdem_v2elsrgel Vdem_v2ellocelc
  <dbl+lbl>     <dbl> <dbl+lbl>             <dbl> <dbl+lbl>      <dbl+lbl>      
1 1 [Albania]    2005 0 [Legislati…            16 5 [Executive … 5 [Executive a…
2 1 [Albania]    2009 0 [Legislati…            20 5 [Executive … 5 [Executive a…
3 1 [Albania]    2013 0 [Legislati…            24 5 [Executive … 5 [Executive a…
4 1 [Albania]    2017 0 [Legislati…            28 5 [Executive … 5 [Executive a…
5 2 [Argentina]  1973 0 [Legislati…             1 5 [Executive … 5 [Executive a…
6 2 [Argentina]  1973 1 [President…             1 5 [Executive … 5 [Executive a…
# ℹ 40 more variables: Vdem_v2elparlel <dbl+lbl>, Vdem_v2ex_elechos <dbl>,
#   Vdem_v2lgbicam <dbl+lbl>, Vdem_v2ddyrpl_1 <dbl>, Vdem_v2ddyrpl_2 <dbl>,
#   Vdem_v2ddyrpl_3 <dbl>, Vdem_v2ddyrpl_4 <dbl>, Vdem_v2ddyrrf_1 <dbl>,
#   Vdem_v2ddyrrf_2 <dbl>, Vdem_v2ddyrrf_3 <dbl>, Vdem_v2ddyrrf_4 <dbl>,
#   Vdem_v2ddyror_1 <dbl>, Vdem_v2ddyror_2 <dbl>, Vdem_v2ddyror_3 <dbl>,
#   Vdem_v2ddyror_4 <dbl>, Vdem_v2ddyrci_1 <dbl>, Vdem_v2ddyrci_2 <dbl>,
#   Vdem_v2ddyrci_3 <dbl>, Vdem_v2ddyrci_4 <dbl>, WB_v_513 <dbl>, …

Step 2: Save and diagnose residuals

The next step is to diagnose the data using R’s loess function to generate predicted values and residuals. I then fit a second loess curve through the scatterplot of the residuals graphed against logged electorate size.

Recall that, in theory, the loess curve that we plot through that residual data cloud should be flat, indicating that the residuals are uncorrelated to logged electorate size. If that is not the case, it would suggest that our initial bandwidth was too large to fully capture the relationship between turnover and logged electorate size.

The residual loess curve fluctuates within narrow bounds around 0. The curve does not systematically rise or fall. This evidence suggests that the residuals are essentially uncorrelated with logged electorate size, and in turn, that the original choice of a bandwidth of .4 was sufficient to capture the relationship between turnout and logged electorate size. One could repeat this exercise using a slightly larger bandwidth in the original plot–but this evidence indicates that bandwidth of .4 is defensible.

Quantile-Quantile Plots

A quantile-quantile (Q–Q) plot compares quantiles of one variable (on one axis) with those of another (on the other axis). Points close to the reference line suggest similar shapes and scales for the two distributions, whereas deviations indicate differences in distribution or scale. One might think overlaid kernel density (KDE) plots of the two distributions would serve much the same purpose. That is not quite so. First, a KDE plots shows the probability density function (PDF) of a variable; a quantile plot shows the cumulative density function (CDF). The two density functions are mathematically related, of course (the PDF is the derivative of the CDF), but for all but the simplest distributions it is difficult for humans to conceive of a distribution’s CDF solely on the basis of a visual depiction of its PDF, and vice versa. Thus, we should see Q-Q plots as complementary to overlaid KDEs. Second, the KDE purposely interpolates between gaps in the data to generate a smooth curve. Quantile plots do not “polish” away gaps in the data in this way–and hence are more directly informative about the frailties of the underlying data. The same can be said for points in the distribution where the data bunch up. For both reasons, we may want to use Q–Q plot to compare the distributions of two variables.

Q–Q Plots for Index Evaluation
A useful variation on the standard quantile–quantile plot is to graph the quantiles of one variable (e.g., an index or specific indicator) against the quantiles of another (e.g., a benchmark index or dimension). This approach can help reveal:

  • Relationships among indicators or dimensions of an index
  • Alignment between an index and a benchmark
  • Potential shortcomings in how the index tracks an underlying concept

Ideally, an index should increase smoothly and monotonically as the latent concept it measures increases. If the Q–Q plot shows “flat spots”, however, it would suggest one measure is rising smoothly while the other “stutters”, indicating a mismatch in how the two variables capture the underlying phenomenon. This issue is common when comparing a continuous measure to an discrete one.

Here is a political science example. The Q-Q plot below compares the Polity index’s democracy dimension with V-Dem’s electoral democracy (polyarchy) index. I multiplied the latter by 10 to align both measures on the same scale, although this step is not necessary.

Conceptually, a Q-Q plot aligns quantiles of one variable with those of another. If both distributions have symmetric moments, data points should lie on the 45-degree line, indicating that a one-percentile rise in one distribution matches a one-percentile rise in the other. We do not see that here. Instead, small increments in V-Dem’s electoral democracy index correspond to large jumps in Polity’s democracy dimension, indicating that Polity differentiates more sharply among regimes that V-Dem considers relatively homogeneous. At higher scores, the pattern reverses: Polity draws minimal distinctions among regimes that V-Dem treats as highly diverse.

This Q-Q plot also highlights the Polity scale’s discrete basis, which clusters observations at each point on the scale. By contrast, V-Dem’s measure is more continuous. This plot does not indicate that one measure is valid and the other invalid; it only shows that their distributions differ, reflecting distinct conceptual and operational assumptions. V-Dem’s electoral democracy index provides finer distinctions among regimes that Polity considers largely homogeneous, and this is especially evident at the higher end (8+) of the Polity democracy scale.

For more detail on Q–Q plots and their applications, see Wikipedia: Q–Q Plot

Exercise: Transforming plotted variables
- Identify a data set and select two variables. 
- Create a scatterplot and highlight any distortion or clustering. 
- Transform one or both variables and produce a second scatter plot. 
- Discuss the effectiveness of the transformation. 
Exercise: Lowess Smoothing
- Generate a scatterplot with a lowess curve. 
- Evaluate the propriety of the bandwidth using residual plots. 
- Provide a brief assessment of the lowess model and submit relevant code and data. 

Summary

This chapter highlights the importance of effective data visualization and transformation in revealing patterns and trends in bivariate relationships. By understanding how to scale, smooth, and transform data, you will be better equipped to interpret complex data sets and derive meaningful insights.

Appendix: Stata and R Code

This Stata code reproduces the “better” time series plot. The nuance in the coding lies in the way that I had to reorder and re-label states to coax Stata to order the facets by flood affected area.


// A faceted trendline charts to avoid occlusion and clutter!
  
use "Calamity.dta", clear

* Ordering states by flood affected area

    gen mean_fafarea = .
    
    forvalues i = 1/21 {
     summ fafarea if state == `i'
     replace mean_fafarea = r(mean) if state == `i'
    }
    
    sort mean_fafarea year
    egen state_order = group(mean_fafarea statenm), label(statenm, replace)
    
* (1) Replace & with "and" in the state names.
replace statenm = subinstr(statenm, "&", "and", .)

* (2) Create a helper variable that picks the state name for each state_order group.
bysort state_order: gen label_name = statenm[1]

* Get the unique numeric levels of state_order.
levelsof state_order, local(levels)

local first = 1
foreach lvl of local levels {
    quietly levelsof label_name if state_order == `lvl', local(mylabel)
    display as text "For state_order value `lvl', mylabel: " `"`mylabel'"'
    if `first' == 1 {
         label define new_lab `lvl' `"`mylabel'"'
         local first 0
    }
    else {
         label define new_lab `lvl' `"`mylabel'"', modify
    }
}

* Attach the new value label to state_order.
label values state_order new_lab


*The plot itself

twoway ///
    (line fafarea year, sort) ///
    (line fafarea year, sort yaxis(2) lcolor(none)), ///
    ytitle("Flood Affected Area (millions hectares)", size(small)) ///
    ylabel(0(4)8, labsize(medsmall)) ymtick(##4) ///
    ylabel(0(4)8, labsize(medsmall) axis(2)) ymtick(##4, axis(2)) ///
    xtitle("") xlabel(1950(10)2000) xmtick(##2) ///
    by(state_order, note("Data Source: Besley & Burgess (2000)", margin(medium) size(small)) ///
    title("Flooding in Indian States, 1950-1996", margin(medium)) legend(off)) aspect(.45)

This Stata code reproduces the scatterplots used to illustrate logarithmic transformations

//  do.file to reproduce Ch. 8 scatterplots

//Upload and pre-process data
use "Kostelka & Blais Base.dta", clear
    
* I want only countries with long time series of 20+ legislative elections

* sort data by county, election type, year
    gsort country EL_TYPE year

*   drop presidential elections 
    drop if EL_TYPE==1

*compute N legislative elections per country    
    by country, sort: gen N_Elections = _N

*keep countries with 20+ elections  
        keep if N_Elections > 19
    
* Create some transformations for use later:
* Odds Ratio
gen PRturnout = (Turnout/100)
gen ORturnout = PRturnout/(1-PRturnout) 
label var ORturnout "Turnout (Odds Ratio)"  

* The logit or log odds of turnout (expressed as a proportion!). 
* Stata has a logit function!
gen logitTO = logit(PRturnout)
label var logitTO "Logit of Turnout"

* Log 10 electorate size
gen log10ELECTORATE = log10(electorate_size * 1000000)
label var log10ELECTORATE "Electorate Size (log base 10)"

*Electorate size in log base 2 -- must use base conversion rule
gen log2ELECTORATE = log10ELECTORATE/log10(2)
label var log2ELECTORATE "Electorate Size (log base 2)"

*You can express logits in base 2 too: 1 unit increase => doubling of log(2) odds. 
gen logit2TO =  logitTO/ln(2)
label var logit2TO "Logit (base 2) of Turnout"

// Panel A: Turnout by electorate size
* Even after rescaling electorate size, electorate size highly right skewed
* Many observations clustered and occluded in top-left of plot

scatter Turnout electorate_size, msize(medsmall) msymbol(circle) mfcolor(gs12%40) mlcolor(gs4) mlwidth(vthin) jitter(1) xtitle("Electorate Size (millions)", size(vsmall)) ytitle("Turnout (% reg. voters)", size(vsmall)) title("Panel A", size(small) position(11) margin(vsmall)) aspect(1) graphregion(margin(small)) plotregion(margin(small))

  *save graph as png.  Add path as necessary.
    graph export "scatter_transform_1.png", as(png) name("Graph") replace

// Panel B: Turnout by log10(electorate size)
*Rescaling electorate size via log10 transformation eases right skew

scatter Turnout log10ELECTORATE, msize(medsmall) msymbol(circle) mfcolor(gs12%40) mlcolor(gs4) mlwidth(vthin) jitter(1) xtitle("log{sub:10} Electorate Size", size(vsmall)) ytitle("Turnout (% reg. voters)", size(vsmall)) title("Panel B", size(small) position(11) margin(vsmall)) aspect(1) xscale(r(5.5 8.5)) xlabel(5.5(.5)8.5, format(%4.1f)) graphregion(margin(small)) plotregion(margin(small))

    *save graph as png
    graph export "scatter_transform_2.png", as(png) name("Graph") replace

// Panel C: Turnout/1-Turnout by log10(electorate size)
* Rescaling by turnout odds ratio is not an improvement
* The transformation "squashes" turnout and interpretation is opaque
    
twoway(scatter ORturnout log10ELECTORATE, msize(medsmall) msymbol(circle) mfcolor(gs12%40) mlcolor(gs4) mlwidth(vthin) jitter(1)  xtitle("log{sub:10} Electorate Size", size(vsmall)) ytitle("Turnout (Odds Ratio)", size(vsmall)) title("Panel C", size(small) position(11) margin(vsmall)) aspect(1) xscale(r(5.5 8.5)) xlabel(5.5(.5)8.5, format(%4.1f)) yscale(r(0 40)) ylabel(0(5)40, format(%4.1f)) graphregion(margin(small)) plotregion(margin(small))) 

    *save graph as png
    graph export "scatter_transform_3.png", as(png) name("Graph") replace


//Panel D: Adding Secondary Axes 
* This took a LOT of work; it's not intuitive & this code is well worth saving
* See https://www.stata-journal.com/sjpdf.html?articlenum=gr0032
* Must be run as a chunk!   

foreach n of numlist  27 50  82  95  99 {
    local labely `labely'  `= round(logit(`n'/100))' "`n'"
    }
  
foreach m of numlist 1 10 100  {
    local labelx `labelx'  `= round(log10(`m'*1000000))' "`m'"
  }
    
scatter logitTO log10ELECTORATE, ylabel(`labely') ytitle("Turnout (% reg. voters)", size(vsmall)) msymbol(circle) mfcolor(none) mlcolor(none)||scatter logitTO log10ELECTORATE, msize(medsmall) msymbol(circle) mfcolor(gs12%40) mlcolor(gs4) mlwidth(vthin) jitter(1) yaxis(2) ylabel(-1 0  1.5  3   4.5, axis(2)) ytitle("log{sub:10}{sup:Turnout}/{sub:1-Turnout}", size(vsmall) axis(2)) xaxis(1 2) xtitle("log{sub:10} Electorate Size", size(vsmall) axis(1)) xscale(r(5.5 8.5) axis(1)) xlabel(5.5 (.5) 8.5, format(%4.1f)) xlabel(`labelx', format(%4.2f) axis(2)) xtitle("Electorate Size (millions)", axis(2) size(vsmall)) legend(off) title("Panel D", size(small) position(11) margin(vsmall)) aspect(1) graphregion(margin(small)) plotregion(margin(small))

    *save graph as png
    graph export "scatter_transform_4.png", as(png) name("Graph") replace

//A panel of Graphs: 
    graph combine   "scatter_transform_1.gph"   ///
                  "scatter_transform_2.gph" /// 
                            "scatter_transform_3.gph"   ///
                            "scatter_transform_4.gph", ///
                  cols(2) colfirst scale(.8) ///    
                  title("Turnout and Electorate Size", size(medsmall)) ///
                  note("Source: Kostelka & Blais (2021)" "Legislative elections in countries with 20+ elections in dataset", size(vsmall))

    *save graph as png
    graph export "table_transformed_scatters.png", as(png) name("Graph") replace
                     
                     
//Final Product
*This is the final product with nice labels and markers
*This combines the best aspects of Panel B and the secondary x-axis of Panel D
*It must be run as a chunk!     
  
foreach m of numlist 1 10 100  {
    local labelx `labelx'  `= round(log10(`m'*1000000))' "`m'"
  }

scatter Turnout log10ELECTORATE, ytitle("Turnout (% reg. voters)", size(small)) msymbol(circle) mfcolor(none) mlcolor(none) xaxis(1 2) xtitle("log{sub:10} Electorate Size", size(small) axis(1)) xscale(r(5.5 8.5) axis(1)) xlabel(5.5 (.5) 8.5, format(%4.1f)) xlabel(`labelx', format(%4.2f) axis(2)) xtitle("Electorate Size (millions)", axis(2) size(vsmall))||scatter Turnout log10ELECTORATE, msize(medsmall) msymbol(circle) mfcolor(gs12%40) mlcolor(gs4) mlwidth(vthin) jitter(1)  legend(off)   title("Turnout and Electorate Size", size(medsmall)) note("Source: Kostelka & Blais (2021)" "Legislative elections in countries with 20+ elections in dataset", size(vsmall)) aspect(1) 

    *save graph
    graph export "scatter_transform_5.png", as(png) name("Graph") replace
Implementing Lowess in Software

This code chunk shows how to employ Stata’s lpoly command to estimate loess curves.

// lpoly:  degree = degree of polynominal; bwidth = bandwidth (denominated in N observations) in window.

*Ex. 1: The degree of the polynomial matters less than you think

twoway (scatter Turnout Majority if country == 5, sort msize(medsmall) msymbol(circle) mfcolor(gs8%50) mlcolor(black) mlwidth(vthin) msize(medium)) (lpoly Turnout Majority  if country == 5, degree(1) bwidth(2)  lpattern(solid) lcolor(reddish)) (lpoly Turnout Majority  if country == 5, degree(3) bwidth(2) lpattern(longdash) lcolor(black)), xmtick(##10, labsize(small)) yscale(r(75 100)) ylabel(75(5)100) xtitle("Majority %") ytitle("Turnout %") title("Turnout at Austrian legislative elections, 1950-2021") legend(on order(1 "Turnout (%)" 2 "Degree = 1" 3 "Degree = 3") span) note("Data source: Kostelka & Blais (2021)" "Bandwidth = 2", size(vsmall)) aspect(1)


*Ex. 2: Holding degree at 1, what is effect of bandwidth?

twoway (scatter Turnout Majority if country == 5, sort msize(medsmall) msymbol(circle) mfcolor(gs8%50) mlcolor(black) mlwidth(vthin) msize(medium)) (lpoly Turnout Majority  if country == 5, degree(1) bwidth(1.25)  lpattern(solid) lcolor(reddish)) (lpoly Turnout Majority  if country == 5, degree(1) bwidth(2.50) lpattern(longdash) lcolor(black)), xmtick(##10, labsize(small)) yscale(r(75 100)) ylabel(75(5)100) xtitle("Majority %") ytitle("Turnout %") title("Turnout at Austrian legislative elections, 1950-2021") legend(on order(1 "Turnout (%)" 2 "Bandwidth = 1.25" 3 "Bandwidth = 2.50") span) note("Data source: Kostelka & Blais (2021)" "Degree = 1", size(vsmall)) aspect(1)

*Ex. 3:  Is a polynomial of degree 1, bandwidth 2.5 defensible?

*estimate the lowess curve and save the predicted values
quietly: lpoly Turnout Majority  if country == 5, degree(1) bwidth(2.50) gen(Predicted_Values)at(Majority)

*If y = f(x) + e, then e = y - f(x). Hence, we get residuals by subtracting predicted from observed turnout.
quietly: gen Residuals = Turnout - Predicted_Values


*This interim plot shows how to visualize the residuals

twoway  (lpoly Turnout Majority  if country == 5, degree(1) bwidth(2.50) lpattern(longdash) lcolor(sky) lwidth(thin)) (scatter Turnout Majority if country == 5, sort msize(medsmall) msymbol(circle) mfcolor(gs8%50) mlcolor(black) mlwidth(vthin) msize(medium)) (scatter Predicted_Values Majority if country == 5, sort msize(medsmall) msymbol(plus) mfcolor(black) mlcolor(black) mlwidth(vthin) msize(medlarge)) , xmtick(##10, labsize(small)) yscale(r(75 100)) ylabel(75(5)100) xtitle("Majority %") ytitle("Turnout %") title("Turnout at Austrian legislative elections, 1950-2021") legend(on order(1 "Lowess (deg: 1; b-width: 2.5)" 2 "Observed Turnout (%)" 3 "Predicted Turnout (%)") span) note("Data source: Kostelka & Blais (2021)", size(vsmall)) aspect(1)

*Ex 4:  Residuals vs Majority plot + lowess
*The lowess curve is pretty flat -- so that is OK
*But we have a clump of large residuals -- our model is missing something in the data!?

twoway  (lpoly Residuals Majority  if country == 5, degree(1) bwidth(2.5) lpattern(solid) lcolor(reddish) lwidth(medthin)) (scatter Residuals Majority  if country == 5, mstyle(circle) msize(medsmall)), yscale(r(-6 6)) ylabel(-6(2)6) yline(0, lcolor(gs8)) legend(off) xtitle("Majority (%)") ytitle("Lowess Residuals") title("Lowess residuals graphed" "against winning party's majority" "at Austrian legislative elections.") aspect(1)

*Check the Austria data! Nonparametric approaches do not make misspecification error disappear.   

twoway  (scatter Turnout Majority  if country == 5 & CV >0, msymbol(circle) msize(medium) mcolor(black)) (scatter Turnout Majority  if country == 5 & CV ==0, msymbol(circle) msize(medium) mcolor(gs10)), legend(on order(1 "Compulsory" 2 "Voluntary")) note("Data source: Kostelka & Blais (2021)" "Note: Majority status is absolute value of winning " "party's vote share - 50 percent", size(vsmall)) aspect(1)  ylabel(75(5)100) title("Turnout and majority status at" "Austrian elections, 1950-2021:") subtitle("Compulsory and voluntary voting.", size(small))
Q-Q Plots in Stata

This code reproduces the Q-Q plot in the text above.

//Load V-Dem data
use "V-Dem-CY-Full+Others-v14.dta", clear

//rescale V-Dem electoral democracy (polyarchy) index
gen PolyX10 = gen PolyX10 = v2x_polyarchy * 10

//generate Q-Q plot

qqplot e_democ PolyX10, mfcolor(none) mlcolor(gs9) mlwidth(vthin) rlopts(lcolor(black) lwidth(vthin) lpattern(dash)) ytitle("Polity - Democracy Score" , size(small)) ylabel(0(2)10, labsize(vsmall)) ymtick(##2) xtitle("V-Dem - Electoral Democracy Index", size(small)) xlabel(0(2)10, labsize(vsmall)) xmtick(##2) title("Benchmarking Polity's Democracy Dimension" "vs V-Dem's Electoral Democracy Index", size(small)) aspect(1)
Log transformations and scatterplots in R

This code produces analogs to the Stata scatterplots of turnout and logged electorate size.

# Load necessary libraries
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(janitor)
library(haven)

# Load data
Kostelka_Blais_Base <- read_dta("P:/pCloud Drive/Teaching/Teaching 2025/POLI381/Data & Code/Kostelka & Blais - Turnout/Kostelka & Blais Base.dta")
elections_data<-Kostelka_Blais_Base

head(elections_data)

# Process data
# Filter out presidential elections (EL_TYPE ==1)
elections_data <- filter(elections_data, EL_TYPE==0)

# Compute N records per country, filter out those with <20
elections_data <- elections_data %>%
     group_by(country = as.character(country)) %>%
     mutate(NumRecords = n()) %>%
     ungroup() %>%
     filter(NumRecords >= 20)

# A basic scatterplot

ggplot(data = elections_data, aes(x=electorate_size, y=Turnout)) + 
  geom_point()

# Transforming a variable using mutate() 

elections_data <- elections_data %>%
  mutate(log10_electorate_size = log10(electorate_size*1000000))

# Repeat scatterplot with log10 electorate size

ggplot(data = elections_data, aes(x=log10_electorate_size, y=Turnout)) + 
  geom_point()

# This computes log2 electorate size 
elections_data <- elections_data %>%
  mutate(log2_electorate_size = log2(electorate_size*1000000))

# Repeat scatterplot with log10 electorate size 
ggplot(data = elections_data, aes(x=log2_electorate_size, y=Turnout)) + 
  geom_point()

## Making it pretty:  Let's improve the scatterplot's aesthetics by:
# - removing ink from the markers to reduce occlusion
# - converting to white background to improve contrast
# - labels to inform readers about what's plotted

## An  improved scatterplot
ggplot(data = elections_data, aes(x=log2_electorate_size, y=Turnout)) + 
  geom_point(shape=1, color="gray60", size=2) +
  labs(x=bquote("log"[2]*" Electorate Size"), y="Turnout (%)", title="Turnout & Electorate Size", subtitle="Legislative elections in 16 countries, 1946-2016", caption="Data Source: Kostelka & Blais (2021)") +
  coord_fixed(ratio=.1) +
  theme_bw()

## Adding a 2nd x-axis
plot2axes <- ggplot(data = elections_data, aes(x=log10_electorate_size, y=Turnout)) + 
  geom_point(shape=1, color="gray60", size=2) +
  labs(x=bquote("log"[10]*" Electorate Size"), y="Turnout (%)", title="Turnout & Electorate Size", subtitle="Legislative elections in 16 countries, 1946-2016", caption="Data Source: Kostelka & Blais (2021)") +
   coord_fixed(ratio=.025) +
  theme_bw()

plot2axes + scale_x_continuous(
  name = bquote("log"[10]*" Electorate Size"),
  sec.axis = sec_axis(~10^.,
                       name = "Electorate size (in millions)",
                       breaks = c(1e6, 10e6, 100e6),  # Set breaks at 1, 10, and 100 million
                       labels = c("1", "10", "100"),  # Set labels as 1, 10, and 100 million
                       )
  )
Loess in R

This code replicates the loess plots above.

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(janitor)
library(haven)

#STAGE 1: LOAD & PREPROCESS DATA

Kostelka_Blais_Base <- read_dta("P:/pCloud Drive/Teaching/Teaching 2025/POLI381/Data & Code/Kostelka & Blais - Turnout/Kostelka & Blais Base.dta")
elections_data<-Kostelka_Blais_Base
# head(elections_data)

#Filter out presidential elections (EL_TYPE ==1)
elections_data <- filter(elections_data, EL_TYPE==0)

#Compute N records per country, filter out those with <20
elections_data <- elections_data %>%
     group_by(country = as.character(country)) %>%
     mutate(NumRecords = n()) %>%
     ungroup() %>%
     filter(NumRecords >= 20)

#STAGE 2: FIT LOESS CURVE TO DATA

#Lowess regression - linear regression, bandwidth .4
elections_data <- elections_data %>%
  mutate(log10_electorate_size = log10(electorate_size*1000000))

ggplot(data = elections_data, aes(x=log10_electorate_size, y=Turnout)) + 
  geom_point(shape=1, color="gray60", size=2) +
  geom_smooth(formula = y ~ x, method = "loess", span = .4, se=TRUE, color="gray20") +  # loess curve with bandwidth .4
  labs(x=bquote("log"[10]*" Electorate Size"), y="Turnout (%)", title="Turnout & Electorate Size", subtitle="Legislative elections in 16 countries, 1946-2016", caption="Data Source: Kostelka & Blais (2021)") +
  coord_fixed(ratio=.025) +
  theme_bw()

#STAGE 3: DIAGNOSE LOESS BANDWIDTH

# Fit loess curve
fit <- loess(Turnout ~ log10_electorate_size, data = elections_data)

# Compute predicted values
predicted <- predict(fit)

# Compute residuals
residuals <- elections_data$Turnout - predicted

#merge above to elections_data
elections_data <- data.frame(elections_data, Predicted_TO = predicted, Residuals_TO = residuals)

#plot Predicted_TO vs log10_electorate_size

elections_data <- elections_data %>%
  mutate(log10_electorate_size = log10(electorate_size*1000000))
ggplot(data = elections_data, aes(x=log10_electorate_size, y=Residuals_TO)) + 
  geom_point(shape=1, color="gray60", size=2) +
  geom_smooth(formula = y ~ x,method = "loess", span = .4, se=FALSE) +  # loess curve with bandwidth .4, se=TRUE returns CIs
  labs(x=bquote("log"[2]*" Electorate Size"), y="Loess Residuals", title="Assessing propriety of bandwidth = .4", caption="Data Source: Kostelka & Blais (2021)") +
   coord_fixed(ratio=.025) +
  theme_bw()
Quantile-Quantile Plots in R

ggplot2 does not readily produce Q-Q plots, but a base R function qqplot does

#load data

library(haven)
V_Dem_CY_Full_Others_v14 <- read_dta("P:/pCloud Drive/Teaching/Teaching 2025/POLI381/Data & Code/V-DEM/V-Dem-CY-FullOthers-v14_dta/V-Dem-CY-Full+Others-v14.dta")

#adapt range of polyarchy variable 
library(dplyr)
V_Dem <- V_Dem_CY_Full_Others_v14 %>%
  mutate(PolyX10 = v2x_polyarchy *10)

#replace missing value codes on Polity democracy variable
V_Dem <- V_Dem %>%
  mutate(e_democ = ifelse(e_democ < 0, NA, e_democ))

#base R graphics work differently the ggplot:

#1. set a square graph region:
par(pty = "s")

#2. use base R's qqplot()
qqplot(
       V_Dem$PolyX10, V_Dem$e_democ,
       xlab = "V-Dem Electoral Democracy Index",      # X-axis label
       ylab = "Polity Democracy Score",               # Y-axis label
       pch = 21,                                      # Circle with fill
         col = "black",                                # Border (outline) color
         bg  = "gray90",                               # Fill color (light gray)
         lwd = 1,                                      # Thicker outline                                       
       main = "Benchmarking Polity's Democracy Scale
            \nvs. V-Dem's Electoral Democracy Index", # Optional main title
       font.main = 1,                                 # Use regular font on title
       cex.main  = 0.8,                                # Smaller title font too
       asp = 1                                        # Aspect ratio of 1 works to place emphasis on 45-degree line
)

#3. add the diagonal reference line
abline(0, 1, col = "red", lty = 2, lwd = 1.5)

It was much easier to re-order the facets of the “better” trend-line plot in ggplot because of the facet_order function. Getting the minimal theme to look the way I wanted it took some extra code–but it was easy to add because of ggplot’s modularity.

# Load required libraries
library(ggplot2)
library(dplyr)
library(forcats)
library(haven)  # For importing Stata files

# Load the Stata data set
data <- read_dta("P:/pCloud Drive/Teaching/Teaching 2025/POLI381/Data & Code/Indian States Calamity Data/Calamity.dta")

# Step 1: Replace "&" with "and" in state names
data$statenm <- gsub("&", "and", data$statenm)

# Step 2: Calculate mean flood-affected area for each state
state_means <- data %>%
  group_by(statenm) %>%
  summarize(mean_fafarea = mean(fafarea, na.rm = TRUE)) %>%
  ungroup()

# Step 3: Merge mean flood area data back into the main data set
data <- data %>%
  left_join(state_means, by = "statenm") %>%
  mutate(state_order = fct_reorder(statenm, mean_fafarea))

# Step 4: Create faceted time series plot with fixed y-axis across all facets
ggplot(data, aes(x = year, y = fafarea)) +
  geom_line(linewidth=.7) +
  facet_wrap(~state_order, scales = "fixed", ncol = 4) +  # **Ensures uniform y-axes**
  labs(
    title = "Flooding in Indian States, 1950-1996",
    subtitle = "Data Source: Besley & Burgess (2000)",
    x = "",
    y = "Flood Affected Area (millions hectares)"
  ) +
  theme_minimal(base_size = 14) +  # Clean theme
  theme(
    strip.text = element_text(size = 8),  # Facet labels more readable
    axis.text.x = element_text(size = 9, angle = 45),
    axis.text.y = element_text(size = 9),
    
    # **Restore Tick Marks**
    axis.ticks = element_line(color = "black", linewidth = 0.5),
    
    # **Major grid lines only**
    panel.grid.major = element_line(color = "gray70", linewidth = 0.5),
    panel.grid.minor = element_blank(),  # Remove minor grid lines

    # **Simulate panel border**
    panel.background = element_rect(color = "black", fill = NA, linewidth = 0.5),
    
    # **Adjust aspect ratio for better horizontal comparisons**
    aspect.ratio = 1/3
  ) +
  
  # Ensure x-axis spans from 1950 to 1995 with major ticks at 10-year intervals
  scale_x_continuous(
    limits = c(1950, 1995),
    breaks = seq(1950, 1995, by = 10)  # Major ticks at 10-year intervals
  ) +
  
  # Ensure y-axis has major ticks at 3-unit intervals (no minor ticks)
  scale_y_continuous(
    breaks = seq(floor(min(data$fafarea, na.rm = TRUE)), ceiling(max(data$fafarea, na.rm = TRUE)), by = 3)
  )




Chapter 9: Conditional Relationships

Introduction

In exploring complex data, we often encounter situations where the relationship between two variables, \(y\) and \(x\), is influenced by a third variable, \(z\). Understanding these conditional relationships is critical for identifying interactive effects and hidden patterns. We can construct three-way tables to examine these relationships, but such tables quickly become cumbersome. An effective alternative way to investigate such relationships is through co-plots, or conditioning plots.

Co-plots

Co-plots display the joint distribution of \(y\) and \(x\) while conditioning on \(z\). These visualizations help assess how the relationship between \(y\) and \(x\) changes based on different values or categories of \(z\). For example:

  • \(y\) is positively correlated with \(x\) when \(z \geq 0\)

  • \(y\) is negatively correlated with \(x\) when \(z > 0\)

Main Methods of Conditioning

There are several techniques available to condition a bivariate relationship on a third variable, each with its own advantages:

  1. Overlaying: Use different markers or colors to represent varying values of \(z\) within the same plot. This method is useful when \(z\) has a limited number of distinct categories.

  2. Facets: Create separate panels or subplots for different values or ranges of \(z\) . Faceting is especially useful for identifying patterns when \(z\) is categorical or can be grouped into meaningful intervals.

  3. 3-D Plots: Display the relationship between \(y\), \(x\) , and \(z\) in three dimensions. 3-D plots can be susceptible to distortion due to perspective issues, making them harder to interpret in some cases.

  4. Plot Matrices: Construct matrices of plots where different combinations of variables (e.g., \(x\) , \(y\), and \(z\) ) are displayed in a grid. Plot matrices can reveal pairwise relationships and help assess interactions systematically.

  5. Contour Plots: Represent the relationship between \(y\) and xx using contour lines to indicate levels of \(z\). Contour plots are particularly effective for continuous variables but may require computational effort.

  6. Heat Maps: Visualize the relationship between \(y\) and \(x\) using color intensity or shading to represent varying levels of \(z\) . Heat maps provide a compact summary of complex 3-way data, but they require careful selection of color scales to avoid ambiguity and misinterpretation.

General Considerations and Notes

  • Is the conditioning variable, \(z\), categorical or continuous? These methods are generally more straightforward when \(z\) is categorical. For continuous \(z\) , dividing it into intervals or using smoothed estimates may improve interpretability.

  • Computational Intensity: Techniques like contour plots and heat maps can be computationally expensive when all variables are continuous, particularly with large data sets. My experience is that R’s ggplot copes with these computational demands better than Stata.

  • Distortion Issues: Some methods, especially 3-D plots, can be misleading due to perspective distortions or lack of depth perception.

  • Pairing with Scatterplots: These conditioning techniques pair well with scatterplots but can also complement other visual tools like quantile-quantile plots for deeper insights.

By understanding and applying these methods, analysts can effectively uncover and interpret complex conditional relationships within their data, leading to more robust analyses and actionable conclusions.

Conditioning Plot Examples

We have already used facets several times in Chapters 6 and 8; here, I extend examples of conditioning plots to 1) overlaid scatterplots, 2) plot matrices; and 3), and heat maps.

Example 1: Overlaid Plot

The overlay technique works well in this plot because the average levels of turnout at elections with and without compulsory voting are quite distinct, and this minimizes occlusion. The result is a plot that shows a clear drop in both turnout and competitiveness after Austria abandoned compulsory voting.

Example 2: Overlaid Plot with Facets

We should not always think of overlays and facets as substitutes. The plot below uses both techniques to show the effects of institutional change in voting laws on turnout and majority size in Austria and the Netherlands. It is worth reflecting on how economically this plot shows variation on four variables: turnout, majority size, compulsory voting, and country.

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(janitor)
library(dplyr)
library(haven)

Kostelka_Blais_Base <- read_dta("P:/pCloud Drive/Teaching/Teaching 2025/POLI381/Data & Code/Kostelka & Blais - Turnout/Kostelka & Blais Base.dta")
elections_data<-Kostelka_Blais_Base

## Now with facets too:

elections_data$CV_alt <- ifelse(elections_data$CV > 0, 1, 0)
elections_data_filtered <- filter(elections_data, country == 5|country==78)

#create country name labels
elections_data_filtered <- elections_data_filtered %>%
  mutate(
    country = as.character(country),  # Convert to character if needed
    country_name = recode(country, "5" = "Austria", "78" = "Netherlands")  # Recoding with quotes
  )

ggplot(data = elections_data_filtered, aes(x = Majority, y = Turnout, fill = factor(CV_alt))) + 
  geom_point(shape = 21, size = 3, color = "black", stroke = 0.8) +  # Shape 21 allows different fill and border
   scale_fill_manual(
    values = c("white", "gray30"),  # First category: white fill, second category: black fill
    labels = c("Voluntary voting", "Compulsory voting")
  ) +
  facet_wrap(~country_name, scales = "fixed") +
  labs(
    x = "Majority (%)", 
    y = "Turnout (%)", 
    fill = "",  # Legend title removed
    title = "Turnout and majority size and compulsory voting",
    subtitle = "Legislative elections in Austria and the Netherlands, 1950-2015", 
    caption = "Data Source: Kostelka & Blais (2021)"
  ) +
  coord_fixed(ratio = 0.5) +
  theme_classic() +
  theme(legend.position = "bottom")

Highlighting Outliers

Color, in my view, should be used sparingly; it is often unnecessary and it can be distracting. Nonetheless, color can be used effectively to draw the reader’s attention to specific aspects of the data. This plot uses a third layer to add color to two outliers. With a little more work, I could probably have made the outlier in the top-right quadrant solid red–and that would be an improvement.

Heatmap

Overlaying and faceting work well when the conditioning variable is discrete and coarse. However, when the conditioning variable is fine-grained or continuous, other techniques are necessary. One option is contour maps; another is heat maps, which use intensity to illustrate how a third variable changes in response to the interaction between two others.

Heat maps present a trade-off similar to that of histograms: we must bin each continuous variable. More bins provide greater detail but risk over-fitting, while fewer bins reduce noise but may obscure meaningful patterns in the data.

Scatterplot Matrix

A scatterplot matrix does what it says: it presents a matrix of scatterplots.

x (col 1) y (col 2) z (col 3)
x (row 1) Diagonal (distribution of x) Upper (relationship of x to y) Upper (relationship of x to z)
y (row 2) Lower (relationship of y to x) Diagonal (distribution of y) Upper (relationship of y to z)
z (row 3) Lower (relationship of z to x) Lower (relationship of z to y) Diagonal (distribution of z)

You can think of a scatterplot matrix as a structured table of plots organized into three regions: a diagonal region and two off-diagonal regions (upper and lower). The off-diagonal regions are typically used to represent joint distributions—that is, pairwise relationships between the variables. However, there’s no necessity for the upper and lower regions to mirror each other: one might display numeric summaries (like correlations) while the other shows scatterplots. The diagonal cells, in contrast, typically show marginal distributions—representing each individual variable—but could alternatively be used simply to label variables clearly.

The markdown table above illustrates these ideas concretely. For instance, the diagonal cells (x–x, y–y, and z–z) show marginal distributions. Meanwhile, the lower-left cell (z–x) exemplifies an off-diagonal plot, depicting the joint distribution—the relationship—between variables z and x.

Here is a series of scatterplot matrices using the underlying data from the UNDP’s Gender Inequality Index. One natural use of a scatterplot matrix is to visualize how the constituent indicators of an index are related to one another.

We can immediately appreciate the negative relationship between maternal mortality rates and the percentage of females receiving secondary-level education. By contrast, neither of those two variables is correlated to female labour force participation rates.






I am not convinced that this is the best iteration of these scatterplot matrices. However, it is impressive at how much R let’s one customize these plots!




Appendix: R Code for Conditioning Plots

This appendix presents the R code used to generate the plots in this chapter:

Overlaid Scatterplot

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(janitor)
library(haven)

Kostelka_Blais_Base <- read_dta("P:/pCloud Drive/Teaching/Teaching 2025/POLI381/Data & Code/Kostelka & Blais - Turnout/Kostelka & Blais Base.dta")
elections_data<-Kostelka_Blais_Base

elections_data$CV_alt <- ifelse(elections_data$CV > 0, 1, 0)
elections_data_filtered <- filter(elections_data, country == 5)

ggplot(data = elections_data_filtered, aes(x = Majority, y = Turnout, fill = factor(CV_alt))) + 
  geom_point(shape = 21, size = 3, color = "black", stroke = 0.8) +  # Shape 21 allows different fill and border
   scale_fill_manual(
    values = c("white", "gray30"),  # First category: white fill, second category: black fill
    labels = c("Voluntary voting", "Compulsory voting")
  ) +
  labs(
    x = "Majority (%)", 
    y = "Turnout (%)", 
    fill = "",  # Legend title removed
    title = "Turnout and majority size",
    subtitle = "Austrian legislative elections, 1950-2015", 
    caption = "Data Source: Kostelka & Blais (2021)"
  ) +
  coord_fixed(ratio = 0.5) +
  theme_classic() +
  theme(legend.position = "bottom")

Overlaid Scatterplot with Facets

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(janitor)
library(dplyr)
library(haven)

Kostelka_Blais_Base <- read_dta("P:/pCloud Drive/Teaching/Teaching 2025/POLI381/Data & Code/Kostelka & Blais - Turnout/Kostelka & Blais Base.dta")
elections_data<-Kostelka_Blais_Base

## Now with facets too:

elections_data$CV_alt <- ifelse(elections_data$CV > 0, 1, 0)
elections_data_filtered <- filter(elections_data, country == 5|country==78)

#create country name labels
elections_data_filtered <- elections_data_filtered %>%
  mutate(
    country = as.character(country),  # Convert to character if needed
    country_name = recode(country, "5" = "Austria", "78" = "Netherlands")  # Recoding with quotes
  )

ggplot(data = elections_data_filtered, aes(x = Majority, y = Turnout, fill = factor(CV_alt))) + 
  geom_point(shape = 21, size = 3, color = "black", stroke = 0.8) +  # Shape 21 allows different fill and border
   scale_fill_manual(
    values = c("white", "gray30"),  # First category: white fill, second category: black fill
    labels = c("Voluntary voting", "Compulsory voting")
  ) +
  facet_wrap(~country_name, scales = "fixed") +
  labs(
    x = "Majority (%)", 
    y = "Turnout (%)", 
    fill = "",  # Legend title removed
    title = "Turnout and majority size and compulsory voting",
    subtitle = "Legislative elections in Austria and the Netherlands, 1950-2015", 
    caption = "Data Source: Kostelka & Blais (2021)"
  ) +
  coord_fixed(ratio = 0.5) +
  theme_classic() +
  theme(legend.position = "bottom")
Highlighting Outliers in an Overlaid Scatterplot
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(janitor)
library(dplyr)
library(haven)
library(ggplot2)
library(dplyr)

Kostelka_Blais_Base <- read_dta("P:/pCloud Drive/Teaching/Teaching 2025/POLI381/Data & Code/Kostelka & Blais - Turnout/Kostelka & Blais Base.dta")
elections_data<-Kostelka_Blais_Base

## Now with facets too:

elections_data$CV_alt <- ifelse(elections_data$CV > 0, 1, 0)
elections_data_filtered <- filter(elections_data, country == 5|country==78)


# Ensure 'country' is character and recode to proper names
elections_data_filtered <- elections_data_filtered %>%
  mutate(
    country = as.character(country),
    country_name = recode(country, "5" = "Austria", "78" = "Netherlands"),
    
    # Flag large majority elections in Austria
    highlight = ifelse(country_name == "Austria" & Majority >= 27, "Highlight", "Normal")
  )

# Create the scatter plot with highlighted points
ggplot(data = elections_data_filtered, aes(x = Majority, y = Turnout, fill = factor(CV_alt))) + 
  geom_point(aes(color = highlight), shape = 21, size = 3, stroke = 0.8) +  # Color applied to border
  scale_fill_manual(
    values = c("white", "gray30"),  # Voluntary = white, Compulsory = gray
    labels = c("Voluntary voting", "Compulsory voting")
  ) +
  scale_color_manual(values = c("Highlight" = "red", "Normal" = "black"), guide = "none") +  # Red for special points
  facet_wrap(~country_name, scales = "fixed") +  # Use proper country names as facet labels
  labs(
    x = "Majority (%)", 
    y = "Turnout (%)", 
    fill = "",  # Legend title removed
    title = "Turnout and Majority Size by Compulsory Voting",
    subtitle = "Legislative elections in Austria and the Netherlands, 1950-2015", 
    caption = "Data Source: Kostelka & Blais (2021)"
  ) +
  coord_fixed(ratio = 0.5) +
  theme_classic() +
  theme(
    legend.position = "bottom",
    strip.text = element_text(size = 10)  # Make facet labels slightly larger
  )

Heatmap

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(janitor)
library(dplyr)
library(haven)
library(ggplot2)
library(dplyr)

Kostelka_Blais_Base <- read_dta("P:/pCloud Drive/Teaching/Teaching 2025/POLI381/Data & Code/Kostelka & Blais - Turnout/Kostelka & Blais Base.dta")
elections_data<-Kostelka_Blais_Base

# Remove NAs from Majority before binning
elections_data_clean <- elections_data %>%
  mutate(
    Majority = round(Majority),
    Majority_bin = cut(Majority, 
                       breaks = seq(min(Majority, na.rm = TRUE), max(Majority, na.rm = TRUE), by = 5), 
                       include.lowest = TRUE, right = FALSE),  # Ensures all values are binned properly
                       
    Turnout_bin = cut(Turnout, breaks = seq(20, 100, by = 10), include.lowest = TRUE, right = FALSE)
  )

# Create heatmap
ggplot(data = elections_data_clean, aes(x = Majority_bin, y = Turnout_bin, fill = Closeness)) + 
  geom_tile() +
  scale_fill_gradient(low = "white", high = "black") +
  labs(
    x = "Majority (%)",
    y = "Turnout (%)",
    fill = "Closeness",
    title = "Heatmap of Turnout vs Majority",
    subtitle = "Binned values for better visualization",
    caption = "Data Source: Kostelka & Blais (2021)"
  ) +
  theme_minimal()

Scatterplot Matrices

In R, you will need to install and load GGally to take advantage of its ggpairs. That function allows you to produce and customize scatterplot matrices to a much greater extent (and with greater ease) than Stata’s graph matrix command.

library(tidyr)
library(GGally)
library(dplyr)
library(haven)

UNDP_GII <- read_dta("P:/pCloud Drive/Teaching/Teaching 2025/POLI381/Data & Code/UNDP Gender Inequality/UNDP GII.dta")

#Select subset of data -- all continuous variables, note!
dataEG <-UNDP_GII %>%  
  select(maternal_mortality, female_secondary, female_lfp)

#ggpairs function produces a customizable scatterplot matrix
ggpairs(dataEG) + 
                 ggtitle("A Default Plot Matrix from `ggpairs`")

#Iteration 1: note ggpairs code calls three regions of a plot matrix
ggpairs(dataEG,
        upper = list(continuous = wrap("cor", size=4)),   #Specifies upper-triangle plots
        lower = list(continuous = wrap("points", alpha=0.6, size=1.5)),  #Specifies lower-triangle plots
        diag = list(continuous = wrap("densityDiag" ))) +  #Specifies diagonal plots
  theme_bw()  + ggtitle("Altering the theme of plot matrix from `ggpairs` is easy!")


#Iteration 2: nice labels, less clutter
library(stringr)

# Original intuitive labels
original_labels <- c("Maternal Mortality per 100K Births",
                     "% Female Secondary Educated",
                     "% Female Labor Force Participation")

# Assign simple labels first (for data)
colnames(dataEG) <- c("mat_mortality", "female_secondary", "female_lfp")

# Create named vector for wrapping
wrapped_labels <- setNames(str_wrap(original_labels, width = 16),
                           c("mat_mortality", "female_secondary", "female_lfp"))

# Now plot using labeller to wrap labels
ggpairs(dataEG,
        upper = list(continuous = wrap("cor", size = 4, color="black")),
        lower = list(continuous = wrap("points", shape = 1, alpha = 1, size = 2, color = "gray20")),
        diag = list(continuous = wrap("densityDiag")),
        labeller = as_labeller(wrapped_labels)) +
  theme_classic(base_size = 10) + ggtitle("Using `stringr` and `labeller` to make nice labels")


#Iteration 3: regression lines and regression estimates in upper
#  Not sure this is best plot -- but the trick of computing a custom function
#  and then using ggpair to plot the function is noteworthy.

#Use ggplot to create a scatterplot w overlaid regression line
my_lower_fn <- function(data, mapping, ...){
  ggplot(data = data, mapping = mapping) +
    geom_point(shape = 1, size = 2, color = "gray30") +
    geom_smooth(method = "lm", se = FALSE, size = 0.5, color = "#CC6677")
}

#Estimate and save regression results
lm_eqn <- function(data, mapping, ...){
  x <- eval_data_col(data, mapping$x)
  y <- eval_data_col(data, mapping$y)
  m <- lm(y ~ x, data = data)
  
  intercept <- format(coef(m)[1], digits = 3)
  slope <- format(coef(m)[2], digits = 3)

  eq <- paste0("italic(y) == ", intercept, " + ", slope, " %.% italic(x)")

  ggplot(data = data, mapping = mapping) +
    annotate("text", 
             x = mean(range(x, na.rm = TRUE)),
             y = mean(range(y, na.rm = TRUE)),
             label = eq,
             parse = TRUE,
             size = 3.5,
             color = "black") +
    theme_void()
}


ggpairs(dataEG,
        upper = list(continuous = lm_eqn),
          lower = list(continuous = my_lower_fn), #plots cutom function above
        diag = list(continuous = wrap("densityDiag")),
        labeller = as_labeller(wrapped_labels)) +
  theme_classic(base_size = 10) + ggtitle("Superimposing regression lines!")




Chapter 10: Presenting Regression Results

Introduction

Linear regression remains a staple of empirical political science research. This chapter offers advice on how to present the results of a regression. I begin with advice on the construction and presentation of regression tables, including how to export them efficiently and quickly from Stata and R. The tabular presentation of regression results remains imperative in many situations because that format conveys a degree of precision that is essential to replication and not always possible via graphical methods. Nonetheless, graphical representations of regression results have become increasingly popular because they can provide intuitive insights into relationships between variables compared to standard tables of coefficients. By visualizing key aspects like predicted values, marginal effects, and interaction effects, we can better understand how variables interact and how outcomes vary.

Regression: What’s Relevant and What’s Not.

Anytime we run a regression in Stata or R, software produces a variety of detailed output. What elements of this output are critical to include in a table of results, and what can be omitted?

  • Our primary concern is the regression coefficients—the marginal effect of \(x\) on \(y\). Present unstandardized coefficients unless the independent variable, \(x\), is an index or something similar with no “natural” units. This is because an unstandardized coefficient represents the unit change in \(y\) given a one unit increase in \(x\). By contrast, a standardized coefficient represents the standard deviation change in \(y\) given a one standard deviation change in \(x\).

  • We also care about how precisely these marginal effects are measured, making either the associated standard errors or confidence intervals relevant.

  • The interpretation of a coefficient’s size relative to its standard error depends on sample size. For instance, with a sample of 1,000,000 observations, nearly every coefficient is likely to have a p-value below 0.05 and thus be deemed “statistically significant.” This illustrates the potential for misleading conclusions when presenting p-values without reference to sample size. Thus, any table of regression results should show the sample size.

  • The \(R^2\) statistic is typically of lesser concern. It is sample-specific and mechanically increases as more independent variables are added. For this reason, statistics such as the adjusted-\(R^2\) or Aikake Information Criterion (AIC) tend be used to convey a regression model’s fit to the data. However, it’s not uncommon to see such measures go unreported.

  • Even in instrumental variables regression, where we seek a strong predictor for the endogenous variable, the focus should be on the first-stage F-statistic rather than its \(R^2\).

  • Similarly, unless its critical to show the reader that clusters of variables are jointly (in)significant, there’s little need to report the model or residual sum of squares.

None of the above is original advice; it largely repeats King’s (1989) recommendations.

Presenting Regression Results Effectively: A Case Study in Table Design

Let’s examine various means of reporting these aspects of a regression. To make things concrete, we use the data from Kam (2017) on bribe prices paid to voters by candidates at parliamentary elections in Victorian Britain. We estimate the following model:


\[ BRIBE_{it} = \beta_1 ELECTORS_{it} + \beta_2 SECRET_{t} + \beta_3 (SECRET \times ELECTORS_{it}) + \Sigma\alpha_i + \epsilon_{it}, \]

where:

\(BRIBE_{it}\) is the average bribe price paid by candidates in district \(i\) at time \(t\)

\(ELECTORS_{it}\) is the number of electors (in thousands) in district \(i\) at time \(t\)

\(SECRET_{t}\) indicates that the election was held after the secret ballot was adopted in 1872

\(SECRET \times ELECTORS_{it}\) is an interaction between \(SECRET_{t}\) and \(ELECTORS_{it}\)

\(\Sigma\alpha_i\) is a vector of dummy variables for districts\(i, j...N\), and

\(\epsilon_{it}\) is a residual.


Our aim is to produce publication-ready tables and graphs that convey various aspects of the regression results, and to do so expeditiously.

I generated the table below using Stata’s esttab command. (esttab is a user-supplied ado file, which you can install via ssc install esttab; ado files are to Stata what libraries are to R.) Stata then produced the table in html format.

Table: Regression Estimates of Bribe Prices at Parliamentary Elections: Before and After the Secret Ballot.

Bribe Price (£) Bribe Price (£) ln(Bribe Price (£))

N Electors (1000s) -0.79
(0.53)
Secret Ballot x N Electors (1000s) 1.60***
(0.60)
√(N Electors) -0.22***
(0.08)
Secret Ballot x √(N Electors) 0.30***
(0.09)
ln(N Electors) -0.70***
(0.17)
Secret Ballot x ln(N Electors) 0.61***
(0.21)
Secret Ballot -16.14*** -25.18*** -5.53***
(4.55) (7.03) (1.78)

Adjusted R-Squared 0.39 0.40 0.51
N Observations 515 515 515
N Districts 98 98 98

** 0.05 *** 0.01
Cell entries are unstandardized regression coefficients, with standard errors in parentheses below.
Column headings identify dependent variable.
Bribe prices standardized in terms of 1906 average earnings.
All models include a full set of district dummy variables.
Secret Ballot is coded 0 for years prior to 1872, and 1 thereafter.
Source: Kam (2017)


The Stata and R code used to reproduce this table is provided in the chapter’s appendix. However, that code is of secondary concern here; the primary focus is on the principles that guide the construction and presentation of regression tables.

While graphs can enhance interpretation, tables of regression coefficients remain indispensable. Academic journals may welcome graphical illustrations of regression effects, but most still require detailed regression tables in an online appendix. This is because tables are better suited for convey the numerical precision that is necessary for replication.

Consider the table below. Just because it presents coefficients and standard errors does not mean we should abandon the principles of good table design discussed in Chapter 5:

  • Title clarity: The table has a clear and informative title that immediately conveys key aspects of the analysis. It identifies the table as presenting regression estimates, with bribe prices as the dependent variable. The title also establishes the scope and nature of the data, covering parliamentary elections from 1820 to 1906, and highlights the institutional shift under examination—the impact of the secret ballot by comparing bribe prices before and after its adoption.

  • Central focus: The coefficients and standard errors are the core elements of the table. They are presented with appropriate rounding to reflect the granularity of the underlying data, avoiding the illusion of excessive precision.

  • Significance notation: Coefficients are marked with significance stars, and a footnote explains their meaning. However, as discussed earlier, significance stars are increasingly falling out of favor. An alternative approach would be to present 95% confidence intervals, which offer a clearer sense of precision and uncertainty.

  • Summary statistics: The table includes only essential summary statistics. The inclusion of the number of districts is particularly valuable, as it allows readers to assess the density of data on which the “within-district” effects are estimated. Given approximately five observations per district (N obs / N districts ≈ 5), this detail provides important context for interpreting the results.

  • Omitting the intercept: The intercept and its standard error are omitted. In any fixed-effects model, the intercept estimate depends on the reference category (here, one particular electoral district), making it largely uninformative.

  • Variable labels: Labels are concise yet intuitively meaningful. One could argue for using the exact variable names from the regression equation, but the labels here strike a balance between clarity and brevity.

  • Table notes: Finally, a series of footnotes clarify key aspects of the table, including:

    • The nature of the cell entries
    • The dependent variable in each specification
    • The significance levels
    • The standardization of bribe prices and its basis
    • The inclusion of fixed effects
  • Minimalist design: Just as in graphical presentation, tables should avoid unnecessary clutter. The clean and simple layout ensures that the reader’s focus remains on the coefficients and standard errors.

By following these principles, we ensure that regression tables remain readable and informative.

Graphical Representations of Regression Results

Lets now consider how we can use graphics to present the same regression results. We generally want to convey to the reader some or all of three basic quantities: coefficients (and their associated standard errors); predicted values; and marginal effects. We construct specific graphics to present these three quantities:

1. Coefficient Plots

A coefficient plot shows each estimated coefficient (and its confidence interval) on a single chart. This approach allows for quick visual comparisons of:

  • Sign: whether a coefficient is positive or negative
  • Magnitude: how large one coefficient is relative to another
  • Precision: how wide or narrow the confidence intervals are

Potential Pitfall: If your model has one or more coefficients that differ greatly in magnitude from the others, the y-axis scale can become distorted—making smaller coefficients appear insignificant when in fact they may not be.

One way to address this problem is to re-scale variables. However, as our example, shows simply shifting from “units” to “thousands” or taking logs (as we did with \(ELECTORS_{it}\) might not always fix the problem. In such cases, a simple table of coefficients and standard errors may be the better way to communicate regression results.

2. Predicted Values Plots

A predicted values plot shows the fitted (expected) value of the outcome variable as one or more predictors vary, often holding other variables at typical (mean or median) values. This approach is particularly helpful for:

  • Showing the difference in outcomes under distinct conditions (e.g., pre- vs. post- policy change)
  • Illustrating non-linear relationships when using polynomials or transformations (e.g., square root, log)

Because predicted values plots necessarily fix other variables to specific values, they can become cumbersome in high-dimensional models or when there are many interaction terms. Still, for focusing on one or two key predictors, these plots can communicate the regression’s substantive implications more intuitively than a raw table or coefficient plot.

3. Marginal Effects Plots

A marginal effects plot shows how the slope (or partial derivative) of the outcome with respect to a given predictor changes, typically over the range of that predictor itself (or over another conditioning variable). Mathematically, for a model \(y = f(x)\), the marginal effect of \(x\) on \(y\) is \(\frac{\partial y}{\partial X}.\)

  • In a simple linear regression with no interactions, however, the marginal effect of \(x\) on \(y\) is just the coefficient on \(x\). In such cases, the marginal effects is constant and a marginal effects plot is obsolete.
  • In models with polynomial terms or interactions, however, the marginal effect of \(x\) on \(y\) varies across values of \(x\) (and/or across levels of the interacting variable). A marginal effects plot is especially useful in such situations because it demonstrates how the effect of \(x\) changes depending on, say, a second variable or different ranges of \(x\).

In short, coefficient plots let you quickly see and compare the coefficient estimates (but can be distorted if magnitudes differ greatly), predicted values plots reveal the expected level of the outcome across changing predictors, and marginal effects plots highlight how the slope of the outcome changes when predictors or interacting factors vary. Each method can shed light on different aspects of a model’s implications.

Applying These Points to the Example

Let’s consider these three possibilities in light of the regression model and estimates of bribe prices at Victorian-era parliamentary elections. We will discuss whether each plot “works” for our example.

1. Coefficient Plots in the Example

The coefficient on \(ELECTORS\) is relatively small compared to the coefficient on \(SECRET\). Even though we have measured \(ELECTORS\) in thousands–effectively multiplying its coefficient by 1,000–it is still dwarfed by the coefficient on \(SECRET\). If we placed them on a coefficient plot with a single axis, as is done below, one has difficulty seeing the coefficient on \(ELECTORS\) and \(ELECTORS \times SECRET\) because \(SECRET\) dominates the scale. This is exactly the sort of situation when a numeric table of coefficients communicates the results more clearly and precisely.

2. Predicted Values Plots in the Example:

If you wanted to see how bribe prices shift as the electorate size changes, before and after the secret ballot, you could plot:

  • One line for the predicted bribe price when \(SECRET = 0\) (pre-policy),

  • Another line for \(SECRET = 1\) (post-policy),

while varying \(ELECTORS\) along the x-axis.

This shows how the levels of bribe prices differ between the two regimes across electorate sizes. Given the simple nature of the regression model in this example, this is a good choice in this instance:


Occlusion is a common problem in predicted values (or marginal effects) plots with multiple functions. That is the case here, with overlapping confidence intervals producing significant occlusion. Translucent shading of the confidence bands reduces the severity of the problem but does eliminate it.

Another limitation of the previous plot is the x-axis, which uses \(\sqrt{\text{ELECTORS}}\) instead of \(\text{ELECTORS}\). This forces the reader to convert values mentally and worse, obscures the inherent non-linearity of the bribe price - electorate size relationship. A better approach depicts predicted bribe prices as a function of \(\text{ELECTORS}\). The revised graph does so and uses faceting to prevent occlusion. Producing this graphic required a different Stata coding approach. The chapter appendix includes the code, which is worth studying because the technique is useful and broadly applicable.

3. Marginal Effects Plots in the Example

We need to exercise care with a marginal effects plots in this case because each of the three specifications has different mathematical form, and hence a different partial derivative of bribe prices with respect to electors. Let’s work through the first two specifications analytically before we generate any marginal effects plots. This will help us to decide if the result is worth graphing–and inform us as to what the software should produce.

Specification 1

\[ BRIBE_{it} = \beta_1 ELECTORS_{it} + \beta_2 SECRET_{t} + \beta_3 (SECRET \times ELECTORS_{it}), \] The partial derivative of \(BRIBE_{it}\) with respect to \(ELECTORS_{it}\) is thus:

\[ \frac{\partial BRIBE_{it}}{\partial ELECTORS_{it}} = \beta_1 + \beta_3 SECRET_t \]

So when \(SECRET_t = 0\), the partial derivative reduces \(\beta_1\). That’s just the coefficient on \(ELECTORS_{it}\)–and we can read that directly from the regression table! When \(SECRET_t = 1\), the partial derivative is \(\beta_1 + \beta_3\)–but that’s just a constant no matter what value \(ELECTORS_{it}\) is. Thus, there is little point in constructing a marginal effects plot for the first specification: it will just show two constants: \(\beta_1\) and \(\beta_1 + \beta_3\). Try it.

Specification 2

\[ BRIBE_{it} = \beta_1 \sqrt{ELECTORS_{it}} + \beta_2 SECRET_{t} + \beta_3 (SECRET_{t} \times \sqrt{ELECTORS_{it}}) \]

Note well: the partial derivative with respect to \(\sqrt{ELECTORS}\) is exactly as in Specification 1—but that’s not what we want! We need the partial derivative with respect to \(ELECTORS\). This is trickier because:

\[ \frac{\partial \sqrt{x}}{\partial x} = \frac{1}{2\sqrt{x}} \]

Working through the calculus (or using Symbolab), we get:

\[ \frac{\partial BRIBE_{it}}{\partial ELECTORS_{it}} = \frac{\beta_1}{2\sqrt{ELECTORS_{it}}} + \frac{\beta_3 SECRET_t}{2\sqrt{ELECTORS_{it}}} \]

This partial derivative changes with both \(SECRET\) and \(ELECTORS\). The interaction term disappears when \(SECRET = 0\), and both terms decrease as \(ELECTORS\) increases. A marginal effects plot is designed to illustrate exactly these kinds of dynamics.

However, we must be careful with what we request from the software. If I use Stata’s dydx(varname) option to compute the partial derivative of BRIBE with respect to ELECTORS via the following code:

areg BRIBE i.SECRET##c.sqrtELECTORS, absorb(craigid)                   // Estimate regression
margins, at(sqrtELECTORS=(10(10)100)) by(SECRET) dydx(sqrtELECTORS)     // Estimate margins
marginsplot                                                             // Plot result

Stata will plot the partial derivative with respect to \(\sqrt{ELECTORS}\)—which is the uninformative quantity we obtained in Specification 1. Instead, we need the partial derivative with respect to \(ELECTORS\). Rather, I must tell Stata to compute,

\[ \frac{\beta_1}{2\sqrt{ELECTORS_{it}}} + \frac{\beta_3 SECRET_t}{2\sqrt{ELECTORS_{it}}} \]

for a given vector of \(ELECTORS_{it}\) values. The result of my effort is as follows:

The code that I wrote to obtain the correct quantities to generate to this marginal effects plot is in the chapter appendix includes the code. It too is worth studying because the technique is broadly applicable.

An aside on log-log regressions

I will not work through Specification 3, which regresses logged bribe prices on logged electorate size. However, it is worth noting that this specification allows one to interpret the coefficients as measuring the relative effect of changes in electorate size on bribe prices. Specifically, the coefficient on electorate size in Specification 3 indicates the percentage change in bribe prices resulting from a one percent change in electorate size. The marginal effect of the interaction can be interpreted just as in Specification 1. That is, before the secret ballot was adopted, bribe prices decreased by \(\beta_1\) percent for every one percent increase in electorate size; after the secret ballot was adopted, they decreased by \(\beta_1 + \beta_3\) percent for every one percent increase in electorate size.

Conclusion

This chapter underscores four key points:

  • Regression tables remain indispensable for conveying precise numerical estimates and enabling replication, even as graphical methods grow in popularity.

  • Thoughtful graphical representations—including predicted values, marginal effects, and coefficient plots—can highlight relationships and interactions more intuitively than raw tables alone. However, not all three types of regression graphics pair well with every regression model, and choosing the right visualization depends on the structure of the estimated model.

  • A mathematical understanding of what one is graphing is essential, especially in nonlinear, interactive regression models. Transformations (e.g., logs, square roots) and careful interpretation of partial derivatives are crucial; without this understanding, one risks mislabeling or misinterpreting slopes and effects.

  • Practical coding strategies—both for efficient table creation and for generating robust graphics—are critical for modern empirical work. Mastering these tools helps ensure that published results are both transparent and accessible to a broad audience.

Appendix: Software Tools and Code

R Code for Regression Tables and Graphics

Using modelsummary() to produce a table of coefficients
  • I use the modelsummary() library to produce a flextable() below. Other libraries exist to produce regression tables, but not all of these tables work with every statistical estimator. The stargazer() library, for example, does not pair natively with the fixed-effects model that I estimate below. Such limitations can be overcome with some jerry-rigging, but I’ll leave that to folks who are more deeply committed to R than I am.
#load libraries
library(haven)
library(tidyr)
library(lfe)
library(labelled)
library(modelsummary)
library(flextable)

#load data

bribes_and_ballots <- read_dta("P:/pCloud Drive/Teaching/Teaching 2025/POLI381/Data & Code/Kam - Secret Ballot/CPS Replication/Bribes_and_Ballots_Replication.dta")

# mutate some covariates
bribes_and_ballots <- bribes_and_ballots %>% 
  mutate(
        sqrtELECTORS = sqrt(ELECTORS*1000),
         lnELECTORS = log(ELECTORS*1000)
     ) %>%
  set_variable_labels(
    sqrtELECTORS = "√(N Electors)",
    lnELECTORS = "ln(N Electors)"
     )

#run some regressions
model1 <- felm(BRIBE ~ SECRET * ELECTORS | craigid, data = bribes_and_ballots)
model2 <- felm(BRIBE ~ SECRET * sqrtELECTORS | craigid, data = bribes_and_ballots)
model3 <- felm(lnBRIBE ~ SECRET * lnELECTORS | craigid, data = bribes_and_ballots)

# Create a single table
# My quick tour of libraries suggests modelsummary() is the best choice here
# But there are alternatives, like stargazer or gtsummary
mynicetable <- modelsummary(list("Model 1" = model1, "Model 2" = model2, "Model 3" = model3),
             stars = TRUE,  # Add significance stars
             gof_map = c("r.squared", "nobs"),  # Show R² and N
             output = "flextable") #can be altered to .docx, tex, html...

# Print it
mynicetable
Using broom() and ggplot2 to a coefficient plot

There are dedicated coefficient plot packages, but one can use the broom() library to sweep the coefficients and standard errors from the regression model, and then pass them to ggplot2.

# Load libraries and data from immediately previous chunk if required
# Some additional libraries 

library(broom)     # For extracting coefficients
library(ggplot2)   # For plotting
library(dplyr)     # For data manipulation

# Uncomment if the models not yet estimated
# model1 <- felm(BRIBE ~ SECRET * ELECTORS | craigid, data = bribes_and_ballots)
# model2 <- felm(BRIBE ~ SECRET * sqrtELECTORS | craigid, data = bribes_and_ballots)
# model3 <- felm(lnBRIBE ~ SECRET * lnELECTORS | craigid, data = bribes_and_ballots)

# Extract coefficients & confidence intervals
coef_df <- tidy(model3, conf.int = TRUE)

# Rename terms in coef_df
coef_df <- coef_df %>%
  mutate(term = case_when(
    term == "SECRET:lnELECTORS" ~ "SECRET x ln(ELECTORS)",  # Change label
    term == "SECRET" ~ "SECRET",  # Keep as is (optional)
    term == "lnELECTORS" ~ "ln(ELECTORS)",  # Change another label
    TRUE ~ term  # Keep other terms unchanged
  ))

#see what's in coef_df
print(coef_df)

# Create coefficient plot
ggplot(coef_df, aes(x = estimate, y = term)) +
  geom_point(color = "black", size = 3) +  # Point estimate
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2, color = "black") +  # CI bars
  geom_vline(xintercept = 0, linetype = "dashed", color = "pink", linewidth = 0.7) +  # Reference line
  theme_bw() +
  labs(title = "Coefficient Plot for Fixed-Effects Regression Model of Bribe Prices", x = "Estimate", y = "Coefficient")
Using `ggeffects to produce a predicted values plot.

The next code chunk uses the ggeffects library to produce a predicted values plot. What’s of note here is that ggeffects does not handle fixed-effect regression well. That’s not a problem, however: estimating a linear model with districts included as a vector of dummy variable produces identical estimates for the substantive coefficients of interest. I then suppress the coefficients on the district dummies because they are superfluous.

#Load data and libraries
library(haven)
library(tidyr)
library(dplyr)
library(labelled)
library(ggplot2)
#I use ggeffect to generate vector of predicted values
library(ggeffects)

#load data

bribes_and_ballots <- read_dta("P:/pCloud Drive/Teaching/Teaching 2025/POLI381/Data & Code/Kam - Secret Ballot/CPS Replication/Bribes_and_Ballots_Replication.dta")

# mutate some covariates
bribes_and_ballots <- bribes_and_ballots %>% 
  mutate(
        sqrtELECTORS = sqrt(ELECTORS*1000),
        lnELECTORS = log(ELECTORS*1000),
        SECRET = factor(SECRET, levels=c(0,1), labels=c("Pre-ballot", "Post-ballot"))
     ) %>%
  set_variable_labels(
    sqrtELECTORS = "√(N Electors)",
    lnELECTORS = "ln(N Electors)",
    SECRET = "Secret Ballot"
     )

#run some regressions.
#felm does not play nice with ggeffect -- but I can use lm; I am not showing the FE coefficients
model1_lm <- lm(BRIBE ~ SECRET * ELECTORS + factor(craigid), data = bribes_and_ballots)

# Compute predicted values WITHOUT fixed effects
bribes_and_ballots_hat <- ggpredict(model1_lm, terms = c("ELECTORS[0:120] by=10", "SECRET"))


# Plot the predictions with separate SECRET groups
ggplot(bribes_and_ballots_hat, aes(x = x, y = predicted, color = group, linetype=group)) +
  geom_line(linewidth = 1) +  # Predicted values
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high, fill = group), alpha = 0.2, color=NA) +  # Confidence intervals
  theme_minimal() +
  labs(title = "Predicted BRIBE Values by SECRET Ballot Status",
       x = "Number of Electors",
       y = "Predicted BRIBE") +
  scale_color_manual(values = c("black", "black")) +  # Ensure black color
  scale_linetype_manual(values = c("solid", "dotted")) +  # Different line styles
  scale_fill_manual(values = c("gray60", "pink")) +  # Adjust fill colors for visibility
  theme_classic() +
  labs(title = "Predicted BRIBE Values by SECRET Ballot Status",
       x = "Number of Electors",
       y = "Predicted BRIBE",
       color = "Secret Ballot",
       linetype = "Secret Ballot",
       fill = "Secret Ballot") +  # Ensure legend labels match
  theme(
    legend.position = "inside",  # ✅ New argument for ggplot2 3.5.0+
    legend.justification = c(0.05, 0.05),  # ✅ Moves legend near the origin
    legend.background = element_rect(fill = "white", color = "white")  # Adds background for readability
  )

Stata Code for Regression Tables and Graphics

The code to replicate the tables and graphs in this chapter follows below.

Preliminary Processing

This chunk loads the data set and generates additional covariates, including the square root and natural logarithm of the electorate size. These transformations allow us to compare different functional forms of the relationship between electorate size and bribe prices. The areg command is used to estimate three different specifications of the regression model, absorbing district-level fixed effects. Additionally, the code demonstrates how to store regression estimates and add custom statistics, such as the number of district fixed effects, using estimates store and estadd scalar.

If you generate the tables and regressions below in several different sessions, you may need to come back and run this chunk again to load and process the data.

** PRELIMINARY PROCESSING
///////////////////////////////////////


//Load Data
use "Secret ballot replication.dta", clear

//generate some covariates
gen sqrtELECTORS = (ELECTORS*1000)^.5
label var sqrtELECTORS "√(N Electors)"

gen lnELECTORS = ln(ELECTORS*1000)
label var lnELECTORS "ln(N Electors)"

label define secret 0 "Pre-Secret Ballot" 1 "Post-Secret Ballot", add
label values SECRET secret      

// Estimate 3 specifications of a regression model
        *areg is a regression with dummy variables for fixed effects -- here for parliamentary districts (craigid) 
        *The notation i.SECRET##c.ELECTORS tells Stata to interact indicator variable SECRET by continuous ELECTORS

//A first specification
areg BRIBE i.SECRET##c.ELECTORS , abs(craigid)
        
        *Stata stores in a systematic manner a lot of information about each estimation 
        ereturn list, all 

        *we tell Stata to use this stored information to build the table
        *store the estimates in a macro, called m1  -- but you can call this anything
        
        estimates store m1
        
        *store an additional statistic, the N of fixed effects / dummy variables
        estadd scalar N_districts e(k_absorb)

//A second specifiction 

areg BRIBE i.SECRET##c.sqrtELECTORS , abs(craigid)
        estimates store m2
        estadd scalar N_districts e(k_absorb)

//A third specifiction  
    
areg lnBRIBE i.SECRET##c.lnELECTORS , abs(craigid)
        estimates store m3
        estadd scalar N_districts e(k_absorb)
Regression Tables

This section constructs a publication-ready regression table using esttab and collect. The esttab command efficiently formats regression output for export, applying formatting options such as dropping constant terms, adding stars for statistical significance, and including key statistics (adjusted \(R^2\), sample size, and number of districts). The collect framework offers an alternative, allowing more customization, including formatting coefficient estimates, setting column labels, and adding footnotes. These techniques are essential for producing well-structured tables that meet journal formatting standards.

**REGRESSION TABLES:
//////////////////////////////////

        
//Collect results from ** PRELIMINARY PROCESSING ** & feed to esttab
//esttabcan output to .tex, docx, html, pdf, or .md files

***run from here--

// Collect results in esttab
// Output to docx, html.tex or .md files if needed

esttab m1 m2 m3 using "P:\pCloud Drive\Teaching\Teaching 2025\POLI381\Textbook\Figures\regtable.html", replace ///
    drop(_cons) nobase noomitted /// Suppress constant, base levels,omitted variables ///
    cells("b(fmt(2) star)" "se(fmt(2) par )") /// Report coefficients & SEs ///
    onecell /// Put b and se in one cell ///
    star(** 0.05 *** 0.01) /// significance stars ///
    stats(r2_a N N_districts, fmt(%9.2f %9.0f) ///
          labels("Adjusted R-Squared" "N Observations" "N Districts")) /// Report these stats ///
    varwidth(40) /// Allow 40 spaces for variable names ///
    order(ELECTORS 1.SECRET#c.ELECTORS sqrtELECTORS 1.SECRET#c.sqrtELECTORS ///
          lnELECTORS 1.SECRET#c.lnELECTORS) /// Present variables in this order ///
    varlabel(1.SECRET "Secret Ballot" ///
             ELECTORS "N Electors (1000s)" ///
             sqrtELECTORS "√(N Electors)" ///
             lnELECTORS "ln(N Electors)" ///
             1.SECRET#c.ELECTORS "Secret Ballot x N Electors (1000s)" ///
             1.SECRET#c.sqrtELECTORS "Secret Ballot x √(N Electors)" ///
             1.SECRET#c.lnELECTORS "Secret Ballot x ln(N Electors)") /// Use these variable labels ///
    noabbrev interaction(x) wrap /// No abbreviations and wrap long variable names ///
    addnotes(" ** 0.05 *** 0.01" ///
             "Cell entries are unstandardized regression coefficients, with standard errors in parentheses below." ///
             "Column headings identify dependent variable." ///
             "Bribe prices standardized in terms of 1906 average earnings." ///
             "All models include a full set of district dummy variables." ///
             "Secret Ballot is coded 0 for years prior to 1872, and 1 thereafter." ///
             "Source: Kam (2017)") /// Add footnotes to table ///
    title("Table: Regression Estimates of Bribe Prices at Parliamentary Elections, 1820-1906: Before and After the Secret Ballot.") ///
    nonumbers ///
    mtitles("Bribe Price (£)" "Bribe Price (£)" "ln(Bribe Price (£))") /// Column labels ///
    collabels(none)

***--to here.
    
// Using collect to build a regression table

collect clear
collect create regtable

*Specification 1
    collect _r_b _r_se, tag(model[1]): areg BRIBE c.SECRET##c.ELECTORS , abs(craigid)
    
*Specification 2
    collect _r_b _r_se, tag(model[2]): areg BRIBE c.SECRET##c.lnELECTORS , abs(craigid)

*Specification 3
    collect _r_b _r_se, tag(model[3]): areg lnBRIBE c.SECRET##c.lnELECTORS , abs(craigid)

collect dims

*collect layout (colname#result result[r2_a N k_absorb]) (model)    
collect layout (colname[SECRET ELECTORS lnELECTORS SECRET#ELECTORS SECRET#lnELECTORS]#result result[r2_a N k_absorb]) (model)

*you can stop and look at a preview any time
collect preview 

// Adjustments to style of the table
    
    *turn off base levels
    collect style showbase off
  
    *format coefficient and se
    collect style cell, nformat(%5.2f)
    collect stars _r_p 0.01 "***" 0.05 "** ", attach(_r_b) noshownote
    collect style cell result[_r_se], sformat("(%s)")   

    *improve readability
    collect style column, extraspace(1)
    collect style row stack, delimiter (" x ")
    collect style cell border_block, border(right, pattern(nil))
    collect style cell cell_type[item column-header], halign(center)
     
    
    *manipulating appearance of adj r squared N, etc
    collect style header result, level(hide) 
    collect style header result[r2_a N k_absorb], level(label) 
    collect label levels result N "Number Observations", modify 
    collect label levels result k_absorb "Number Districts", modify 
    collect style cell result[N k_absorb], nformat(%5.0f)
    
    *altering variable names -- which collect identifies as levels (rows) of each column
    collect label levels colname ELECTORS "N Electors (1000s)", modify
    collect label levels colname lnELECTORS "ln(N Electors)", modify
    collect label levels colname SECRET "Secret Ballot", modify

    *alter column labels to show DV
    collect label levels model "1" "Bribe Price (£) ", modify
    collect label levels model "2" "Bribe Price (£) ", modify
    collect label levels model "3" "ln(Bribe Price (£))", modify
    
    *Table footnotes
     collect notes : "*** p < .01, ** p < .05"
     collect notes : "Cell entries are unstandardized regression coefficients, with standard errors in parentheses below." 
   collect notes : "Column headings identify dependent variable." 
   collect notes : "Bribe prices standardized in terms of 1906 average earnings." 
     collect notes : "All models include a full set of district dummy variables." 
     collect notes : "Secret Ballot is coded 0 for years prior to 1872, and 1 thereafter."
   collect notes : "Source: Kam (2017)"

     collect title "Table: Regression Estimates of Bribe Prices at Parliamentary Elections: Before and After the Secret Ballot."

collect preview

collect export collected_regtable.html, replace
Regression Plots

This section provides code to produce types of regression plots.

Coefficient Plots in Stata

  • The coefplot command presents estimated coefficients and confidence intervals, allowing quick comparisons between predictors. This is useful when dealing with multiple models and interactions.
//Load data & preprocess if necessary

//Load Data
  use "Secret ballot replication.dta", clear

//generate some covariates
  gen sqrtELECTORS = (ELECTORS*1000)^.5
  label var sqrtELECTORS "√(N Electors)"

  gen lnELECTORS = ln(ELECTORS*1000)
  label var lnELECTORS "ln(N Electors)"

  label define secret 0 "Pre-Secret Ballot" 1 "Post-Secret Ballot", add
  label values SECRET secret        


//Run a regression & store results
  areg lnBRIBE i.SECRET##c.lnELECTORS , abs(craigid)
          estimates store m3
     
// Call results into coefplot to produce a basic coefficient plot
coefplot m3, levels(95) msymbol(o) mfcolor(white) mlcolor(black) mlwidth(thin) msize(medium) ciopts(lwidth(medthin) lcolor(black)) xline(0)  drop(_cons) rename(1.SECRET = "Secret Ballot" lnELECTORS = "ln(N Electors)" 1.SECRET#c.lnELECTORS  = "Secret Ballot x ln(N Electors)"  )  xlabel(-10(2)2) title("Figure: A Basic Coefficient Plot")
                                                                                                                                                                

Predicted Values Plots in Stata

  • The margins and marginsplot commands generate expected bribe prices as a function of electorate size, before and after the secret ballot. The default marginsplot is improved with translucent confidence intervals and better formatting.

  • Using twoway to Overcome marginsplot Limitations: To display results on an alternative x-axis (electorate size instead of its square root), the margins results are saved as a data set and plotted with twoway. This technique provides greater flexibility than marginsplot when visualizing transformations.


// Predicted Values Plot Using margins & marginsplot

//Load & preprocess data as in pervious chunk if necessary


//We want to know range of sqrt.ELECTORS so that our predicted values plot is within data range.
        summ c.sqrtELECTORS, det  // 0-120 covers 95% range

//we must first estimate a statistical model to use margins

  areg BRIBE i.SECRET##c.sqrtELECTORS , abs(craigid)

//Skeletal output of marginsplot
  margins, at(sqrtELECTORS=(0(10)120)) over(SECRET) 
  marginsplot       

//Improving marginsplot
  margins, at(sqrtELECTORS=(0(10)120)) over(SECRET) 
  marginsplot, recastci(rarea)                                            /// 
     plot1opts(lcolor(black) lpattern(dash) lwidth(vthin) mstyle(none))   /// thinner, gray-scale lines, no adornment
     plot2opts(lcolor(black) lpattern(solid) lwidth(vthin) mstyle(none))  ///
     ci1opts(fcolor(gs12%40) lwidth(vvthin))                              ///   shaded and transluscent CIs
     ci2opts(fcolor(gs12%50) lwidth(vvthin))                              ///
     ylab(-20(10)30, labsize(medsmall)) ytitle("Bribe Price (£ - 1906 Avg. Earnings)", size(small)) /// add title 
     yline(0, lcolor(reddish) lwidth(medium))                             ///  reference line
       xlab(0(20)120, labsize(medsmall))                                    ///  adjusted range
     legend(position(2) ring(0) cols(1) symysize(small))                  /// shift legend into unused space
    aspect(1) title("Predicted Bribe Prices:" "Before and After the Secret Ballot", size(medsmall))
    
// I cannot superimpose a second x-axis to see the nonlinearity of the function; I need a new technique

//Using twoway to overcome marginsplot limitations  
    //Load Data
     use "Secret ballot replication.dta", clear
         
    //generate some covariates
    gen sqrtELECTORS = (ELECTORS*1000)^.5
    label var sqrtELECTORS "√(N Electors)"
    
    gen lnELECTORS = ln(ELECTORS*1000)
    label var lnELECTORS "ln(N Electors)"
    
    label define secret 0 "Pre-Secret Ballot" 1 "Post-Secret Ballot", add
    label values SECRET secret      
     
  //Estimate a model    
    areg BRIBE i.SECRET##c.sqrtELECTORS, absorb(craigid)
    
    
    //Save output of margins as its own data set
    margins, at(sqrtELECTORS=(0(5)120)) by(SECRET) saving(mymargins, replace)

    //create and switch to a new frame and load the margins data set
    //frame create only needs to be run once
    frame create predictplot
    frame change predictplot

    use mymargins, clear

    //re-name/-create variables
    gen sqrtELECTORS = _at2
    gen ELECTORS = (_at2^2)
    gen SECRET = _by1
    
    label define secret 0 "Pre-Secret Ballot" 1 "Post-Secret Ballot", modify
    label values SECRET secret
    label var SECRET "Secret Ballot Era"

    twoway ///
        (rarea _ci_lb _ci_ub ELECTORS, color(gs12%40) lwidth(vvthin))  ///
        (line _margin ELECTORS, lwidth(medium) lcolor(black)),         ///
        by(                                                            ///
            SECRET, legend(off)                                          ///
            note("Shaded areas represent 95% confidence intervals." "Bribe prices in terms of 1906 Avg. Earnings.", size(small))                                                       ///
            title("Predicted Bribe Prices:" "Before and After the Secret Ballot", size(medsmall)) ///
            )                                                           ///
        xtitle("N Electors")                                          ///
        xlabel(0(5000)15000, format(%9.0fc) labsize(medsmall))        ///
        xmtick(##3)                                                   ///   
        ytitle("Bribe Price (£)")                                     ///
        ylabel(-20(10)20, labsize(medsmall))                          ///
        yline(0, lcolor(reddish))                                     ///
        ymtick(##2)                                                   ///
        plotregion(margin(4 2 0 0)) 
    
    //return to main dataset in the default frame
    frame change default

Marginal Effects Plots in Stata

  • The dydx option in margins is first used incorrectly (derivative with respect to \(\sqrt{ELECTORS}\) instead of \(ELECTORS\)), illustrating the importance of understanding the mathematics behind what is plotted.

  • The correct partial derivative is then manually computed using expression() in margins and plotted using twoway.

  • Re-scaling for Interpretability: The final graph re-scales electorate size to hundreds and adjusts the marginal effect accordingly, improving the substantive interpretation of results.

These techniques—modifying marginsplot, saving margins output for custom plotting, and re-scaling variables—are useful skills for producing clear and interpretable regression graphics.


//Load & preprocess data as in previous chunk if necessary
    
// Estimate the regression model
areg BRIBE i.SECRET##c.sqrtELECTORS , abs(craigid)

// option dydx(varname) generates partial deriavative wrt varname--but understand what is computed!
margins, at(sqrtELECTORS=(10(10)100)) by(SECRET) dydx(sqrtELECTORS) 
marginsplot, recastci(rarea)                                            ///  This is wrt sqrtELECTORS, not ELECTORS!
    plot1opts(lcolor(black) lpattern(dash) lwidth(vthin) mstyle(none))    /// 
    plot2opts(lcolor(black) lpattern(solid) lwidth(vthin) mstyle(none)) ///
    ci1opts(fcolor(gs12%40) lwidth(vvthin))                              ///  
    ci2opts(fcolor(gs12%50) lwidth(vvthin)) 

// Using expression() to obtain deriavative wrt N electors
// I use two-way strategy again

// Load & preprocess data again, if necessary

// Estimate regression
areg BRIBE i.SECRET##c.sqrtELECTORS, absorb(craigid)

// Compute margins and save dataset
margins, at(sqrtELECTORS=(0(5)120)) by(SECRET) ///
    expression((_b[sqrtELECTORS] + _b[1.SECRET#c.sqrtELECTORS] * SECRET) / (2 * sqrtELECTORS)) ///
    saving(mymarginsdydx, replace)

// Switch to new frame for dataset manipulation
*frame create dydxplot
frame change dydxplot

use mymarginsdydx, clear

// Rename and compute necessary variables
gen sqrtELECTORS = _at2
gen ELECTORS = sqrtELECTORS^2
gen SECRET = _by1

gen ELECTORS_K    = (sqrtELECTORS^2) / 100  // now in hundreds

* Convert partial derivative and confidence intervals to "per 100 electors"
* this is purely re-scaling-- but it conveys substantive effect better
gen margin_k  = _margin  * 100
gen ci_lb_k   = _ci_lb   * 100
gen ci_ub_k   = _ci_ub   * 100


// Label categorical variable
label define secret 0 "Pre-Secret Ballot" 1 "Post-Secret Ballot", modify
label values SECRET secret
label var SECRET "Secret Ballot Era"

// Plot the marginal effects with correct x-axis scale
twoway                                                                                                   ///
    (rarea ci_lb_k ci_ub_k ELECTORS_K if ci_lb_k >= -2.2 & ci_ub_k < 1.2, color(gs12%40) lwidth(vvthin)) ///
    (line margin_k ELECTORS_K, lwidth(medium) lcolor(black)),                                            ///
    by(SECRET, legend(off)                                                                               ///
        note("Shaded areas represent 95% confidence intervals."                                          ///
             "Bribe prices in terms of 1906 Avg. Earnings.", size(small))                                ///
        title("Marginal Effects of Electorate Size on Bribe Prices" "Conditional on Secret Ballot", size(medsmall)) ///
    )                                                    ///
    xtitle("N Electors (100s)")                          /// x-axis cosmetics
    xlabel(0(30)150, format(%9.0fc) labsize(medsmall))   ///
    xmtick(##6)                                          ///
    ytitle("d(Bribe)/d(Electors)")                       /// y-axis cosmetics
    ylabel(-2(1)1, labsize(medsmall))                    ///
    yline(0, lcolor(reddish))                            ///
    ymtick(##2)                                          ///
    plotregion(margin(4 2 0 0))

frame change default

Conclusion

This book has emphasized the critical importance of conceptual clarity, attention to detail, and iteration in managing and presenting data. First, before collecting and building a data set, researchers must have a well-developed conceptual understanding of their theoretical model and measures. Without this foundation, the resulting data risk being incomplete, misleading, or analytically unusable. Second, simple attention to detail—careful documentation, systematic coding, and clear labeling—ensures that data sets remain interpretable, reproducible, and reliable. Third, nearly every example in this book required multiple iterations. First attempts were rarely optimal; refinement and adjustment were necessary at every stage.

In addition to these principles, the book has highlighted the importance of thoughtful graphical presentation. While regression tables remain indispensable for precision and replication, effective visualizations can make patterns and relationships more intuitive. Finally, a mathematical understanding of transformations and interactions is essential for accurately interpreting statistical models and presenting them correctly. Whether handling data or designing figures, a clear conceptual framework, rigorous execution, and iterative improvement remain the cornerstones of effective empirical research.


References

Besley, T. and Burgess, R., 2002. The political economy of government responsiveness: Theory and evidence from India. The quarterly journal of economics, 117(4), pp.1415-1451.

Cleveland, W.S., 1993. Visualizing data. Hobart press.

Cleveland, W.S., 1994. The elements of graphing data. Hobart press.

Jacoby, W.G., 2000. Loess: a nonparametric, graphical tool for depicting relationships between variables. Electoral studies, 19(4), pp.577-613.

Kam, C., 2014. Enfranchisement, Malapportionment, and Institutional Change in Great Britain, 1832–1868. Legislative Studies Quarterly, 39(4), pp.503-530.

Kam, C., 2017. The secret ballot and the market for votes at 19th-century British elections. Comparative Political Studies, 50(5), pp.594-635.

Kastellec, J.P., 2023. Practical Advice for Producing Better Graphs.

King, G., 1986. How not to lie with statistics: Avoiding common mistakes in quantitative political science. American Journal of Political Science, pp.666-687.

Kostelka, F. and Blais, A., 2021. The generational and institutional sources of the global decline in voter turnout. World politics, 73(4), pp.629-667.

Schwabish, J., 2021. Better data visualizations: A guide for scholars, researchers, and wonks. Columbia University Press.

Wickham, H. and Grolemund, G., 2017. R for data science (Vol. 2). Sebastopol, CA: O’Reilly.